Data management strategy for a collaborative research center

Abstract The importance of effective research data management (RDM) strategies to support the generation of Findable, Accessible, Interoperable, and Reusable (FAIR) neuroscience data grows with each advance in data acquisition techniques and research methods. To maximize the impact of diverse research strategies, multidisciplinary, large-scale neuroscience research consortia face a number of unsolved challenges in RDM. While open science principles are largely accepted, it is practically difficult for researchers to prioritize RDM over other pressing demands. The implementation of a coherent, executable RDM plan for consortia spanning animal, human, and clinical studies is becoming increasingly challenging. Here, we present an RDM strategy implemented for the Heidelberg Collaborative Research Consortium. Our consortium combines basic and clinical research in diverse populations (animals and humans) and produces highly heterogeneous and multimodal research data (e.g., neurophysiology, neuroimaging, genetics, behavior). We present a concrete strategy for initiating early-stage RDM and FAIR data generation for large-scale collaborative research consortia, with a focus on sustainable solutions that incentivize incremental RDM while respecting research-specific requirements.


Introduction
Extensiv e efforts hav e r ecentl y been made to pr omote the r epr oducibility, r eplicability, and tr anspar ency in scientific r esearc h.
The evolution of open-access publishing [ 1 ], open-source data repositories [ 2 ], and open-sour ce softw are applications has transformed the work of r esearc hers in v arious fields. As r esearc h has become more sophisticated, these developments were inevitable, and large-scale multidisciplinary projects have been developed to pr omote collabor ativ e work.
Research in the field of neuroscience increasingly encompasses a variety of fields, including biophysics, molecular biology, medicine, cognitiv e neur oscience, psyc hology , and ethology . Neuroscience datasets are constantly growing as a result of scientific advances in acquisition systems that produce large-scale m ultimodal datasets [3][4][5][6][7]. Mor eov er, r esearc h institutions ar e incr easingl y involv ed in interdisciplinary collabor ativ e r esearc h. Suc h collabor ativ e de v elopments pose ne w c hallenges for r esearc h data mana gement (RDM) [ 8 ], specificall y in terms of data harmonization, use of computational resources, and data sharing [ 9 ]. Integration of neuroscientific datasets and data sharing are among the greatest obstacles in large-scale consortia that combine multimodal and multisite studies . T hese challenges become e v en mor e pr onounced if they ar e not dealt with fr om the beginning and can have a direct impact on r esearc h collabor ations and the publishing process [ 10 ].
The primary goal of this report is to describe our a ppr oac h to implementing a data management strategy across a collabor ativ e r esearc h consortium (Heidelber g-based collabor ativ e r e-search center [CRC] 1158). The consortium comprises independent, m ultidisciplinary r esearc h gr oups with a common goaloriented r esearc h. Additionall y, we aim to provide recommendations and guidelines for best practices. We specifically describe our ongoing RDM efforts, whic h ar e divided into 2 sections: (i) RDM planning phase: e v aluating common data management c hallenges, RDM c hallenges in specific pr ojects, and CRC r esearchers' data management requirements and (ii) RDM implementation phase: implementing common RDM pr ocedur es acr oss consortium pr ojects, r esource allocation decisions, and k e y resources . T his report discusses our experience in de v eloping and implementing a data mana gement str ategy and offers concrete solutions to pr omote m ultidisciplinary collabor ativ e r esearc h and open science objectives.

Consortium-wide RDM planning phase
The RDM planning phase is critical to ensuring effective and efficient RDM. During this phase, we assess common and projectspecific data c hallenges, r e vie w the scope and objectives of the consortium and its r esearc h pr ojects, collect information on data acquisition, implement data management policies and procedur es, and cr eate a compr ehensiv e data mana gement plan. This plan must take into account the type of data being collected, stor ed, and secur ed; the anal ysis methods; and an y legal or ethical considerations, including those related to sensitive data. We also r e vie w the challenges that have emerged from our consortium's efforts to cr eate coher ent RDM planning for diverse datasets, dev eloping common infr astructur e, documenting data and meta-data, and establishing pr ocedur es for sharing, arc hiving, and handling sensitive data. This phase provides the foundation for successful and sustainable data management and compliance with laws and regulations.

Common data management challenges across projects
Challenges due to the diversity in data types While de v eloping an RDM str ategy for our CRC, a centr al c hallenge was the diversity of data types produced by multidisciplinary a ppr oac hes utilized in differ ent pr ojects . T he projects involve balancing data from basic and clinical r esearc h as well as data from animals and humans. Diverse signals are collected at various spatial and temporal scales, such as single-cell and network data, genetic (e.g., genome-wide association studies, gene expr ession pr ofiles, epigenetic modifications), ima ging (e.g., ma gnetic r esonance ima ging [MRI]; positr on emission tomogr a phy [PET]), intr a-and extr acellular electr ophysiology, calcium ima ging using fluorescence-based microscopy, confocal light sheet micr oscopy, behavior al data (e.g., task performance), and clinical data (e.g., patient surve ys, medication, cogniti ve assessments, and psyc hological questionnair es). These div erse tec hniques and collected data types raise multiple data manageability issues within and between projects, with broad implications for data interoper ability and r euse. Some of these manageability issues include inconsistent data formats, limited harmonization of heterogeneous datasets, integration of multimodal datasets, difficulties in ac hie ving and maintaining good data quality (c hec king for missing data and duplicates), and adhering to privacy and security regulations while enabling access to specific users.
All human r esearc h pr ojects in our consortium use a multimodal a ppr oac h that collects and combines data from 2 or more of the following methods: MRI, including structural (anatomical, diffusion-weighted imaging) and functional (task based, resting state); electr oencephalogr a phy (EEG), ma gnetoencephalogr a phy (MEG); behavior al; psyc hometric e v aluations; and genetics . T he use of multiple data sources allows r esearc hers to identify relationships between different modalities and to gain a more compr ehensiv e and accurate understanding of the neural processes underlying their research topic. To improve reliability and gain further insights, datasets from multiple sources at different time points can be integrated. Ho w ever, there are significant problems associated with acquiring and integrating multimodal datasets [ 11 ].
The cost of acquiring data from multiple sources can be prohibitiv el y expensiv e, and the process of collecting and combining the data can be time-consuming because it fr equentl y necessitates the use of specialized equipment and software [ 12 ]. For instance, combining functional MRI (fMRI) and EEG data [ 13 ] into a unified dataset r equir es knowledge of both data types and use of special software to analyze them. Interpreting the results of multimodal data can be challenging due to the complexity of the data and the potential for bias . T her efor e, this can lead to overfit models and to incorrect conclusions . T he complexity of the data, as well as the need for specialized software, can cause a bottleneck in data processing and analysis.
Additionally, it can be difficult to accur atel y integr ate data fr om different sources due to differences in formats, storage locations, scales, and resolutions (spatial and temporal) such as resulting from fMRI, EEG, and beha vior. T here ma y be discrepancies between the data collected from different sources, making it difficult to integrate the data into a compr ehensiv e benc hmark dataset.
The datasets and metadata may not be structured in a consistent way that allows for integration with other datasets or the use of more sophisticated data analysis techniques. Labeling data points is often difficult and time-consuming, making it difficult to de v elop accur ate models. As differ ent complex measur ements become routine parts of data collection, this problem will only increase.
Mor eov er, data ar e often acquir ed at m ultiple time points, either for longitudinal assessments or due to time constraints among participants or to pr e v ent volunteer fatigue [ 14 ]. Datasets gather ed ov er se v er al days ar e typicall y r andomized or pseudorandomized. This is particularly the case in some human projects where longitudinal studies are conducted [ 15 ]. Studies that involv e r epeated assessments typicall y span a long period of time (e.g., 10 years). In such studies, it is common that data are collected by differ ent r esearc hers, with differ ent types of softwar e, and new experiments may be added or r emov ed. It is ther efor e essential to provide sufficient metadata as a set of documents available to download alongside the data themselves . T his would support data reuse and enable accurate analysis and interpretation.
Although man y adv ances hav e been made r egarding the or ganization, annotation, and description of r esearc h datasets, ther e is still m uc h work to be done to ensur e that datasets ar e full y standardized and can be accur atel y shar ed and r eused [ 16 ]. For example, data standards such as the Brain Imaging Data Structure (BIDS) [ 17 ] exist for neuroimaging data [ 18 ], EEG-BIDS for electr oencephalogr a phy data [ 19 ], and MEG-BIDS for magnetoencephalogr a phy data [ 20 ], while standards for other data modalities (e.g., sensory testing, ecological momentary assessments [EMAs]) are not yet a vailable . Particularly, integrating beha vioral data is challenging, as there is a lack of clear standards and ontology that allows for generalization and thus grouping of different behavioral paradigms (see section about behavior data standardization). Finding data that can potentially be pooled remains challenging, let alone ensuring that datasets are in standardized formats for meta-analyses b y thir d parties. Strategies for a ppl ying Findable , Accessible , Inter oper able, and Reusable (FAIR) [ 21 ] principles are still under development, and standard annotation systems and clear data identifiers are crucially needed.

Challenges due to diversity in acquisition, pr epr ocessing and analysis approaches
T he a v ailability of r obust neur oscience r esources suc h as highperformance computing (HPC) clusters [22][23][24], modern w orkflo w technologies (e.g., Galaxy [ 25 ], Snak emak e [ 26 ]), cloud-enabled stor a ge and computing infr astructur es (e.g., Amazon AWS, Google Cloud [ 27 ]), secure databases [ 28 ], repositories [ 29 ], and analysis platforms is fundamentall y c hanging how r esearc h in neur oscience is communicated and linked to existing raw data and findings [ 30 ]. Such tools allow researchers to utilize diverse techniques and pr oduce massiv e amounts of high-dimensional data (lar ge sample size, various models and conditions), providing greater statistical po w er and the opportunity to perform robust secondary data analysis [ 31 ]. Ho w ever, the data-driven neuroscience appr oac h faces se v er al tec hnical issues that need to be r esolv ed before its full potential can be realized.
While the majority of collabor ativ e r esearc h consortia collect div erse m ultidimensional datasets, one of the primary challenges is that most of these datasets are typically inadequate for modern r esearc h methods and infr astructur e [ 32 ]. Befor e committing to any tools for processing and analysis, it is important to understand the type , format, size , and complexity of the collected data.
Neuroscience experiments often result in incompatible datasets that cannot be compared and pooled acr oss differ ent r esearc h gr oups due to the use of custom methods for organizing and describing data. Data formats used in eac h pr oject may also vary, leading to data and metadata being stored in different locations. Even if datasets are imported into a common file format, r esearc hers' c hoices for data pr epr ocessing and anal ysis may not be compatible between laboratories or even between different projects in the same laboratory. This is further complicated by the use of various resources, such as custom preprocessing w orkflo ws and softwar e, whic h can vary widely.
Custom pr epr ocessing pipelines and anal ysis scripts ar e another significant challenge . T hese pipelines are tailored to meet the specific needs of the project or r esearc h gr oups and may use a combination of open-source software and proprietary tools (e.g., Python scripts for examining oscillatory frequencies associated with experimental pain, follo w ed b y proprietary softw are for statistical analysis or visualization). As a result, researchers may need to write custom analysis scripts or to convert datasets into compatible formats to use publicly available analysis tools . T he use of different preprocessing or analysis software can also result in different file input and output formats, making it challenging to compare and pool datasets . Furthermore , lab-specific w orkflo ws and pipelines usually prioritize internal needs over the needs of a br oader comm unity, whic h can limit the r epr oducibility of the r esearc h outcomes. Additionall y, lab-customized softwar e or hardware solutions may not perform efficiently on large or complex datasets. Using third-party tools and software in w orkflo ws may lead to broken dependencies and issues with r epr oducibility [ 33 ]. This issue is exacerbated by the fact that original analyses were done in different software en vironments , operating systems (e.g., Linux, Macintosh), and software versions.
One of the most pr essing c hallenges encounter ed in our consortium projects is the lack of standardized pr epr ocessing and anal ysis a ppr oac hes [ 34 ]. Deep learning neur oima ge anal ysis tools r equir e significant computing po w er, memory, and stor a ge, and HPC clusters can provide these r esources, dr amaticall y acceler ating the anal ysis pr ocess [ 35 , 36 ]. Ho w e v er, nonexperts may find it challenging to access these resources and perform scientific computing [ 22 ]. Especially for experimentalists, there is a fundamental need to provide succinct documentation on how to use these r esources efficientl y. To addr ess these c hallenges, a pplications for image data processing must have application programming interfaces (APIs) and a user-friendly graphical user interface (GUI) that can be utilized without specialized coding knowledge. It is often recommended to use comparative analysis methods and multiple softwar e pac ka ges to obtain reliable and r epr oducible r esearc h r esults. De v eloping suc h tools (e.g., bwVisu [ 37 ]) r equir es significant customization and softwar e de v elopment costs, whic h may not be possible for individual r esearc h labs. To overcome these obstacles, experimentalists , data managers , and computer scientists must work in a close, strategic partnership.
Repr oducibility and v ariability in published r esults has been a topic of investigation [ 38 , 39 ], and research has shown that there is no single "best" way to process and anal yze lar ge-scale single or multimodal datasets. For instance, a neur oima ging study pr esented the results of a survey of fMRI experiments that r e v ealed substantial differences in how individual labs pr epr ocess and anal yze their data. Se v enty independent labor atories anal yzed the same dataset and pr oduced v arying r esults [ 40 ]. Another study supported these findings and sho w ed that analytical decisions made by individual r esearc hers can significantly impact the findings from an fMRI data set [ 41 ]. Analyzing fMRI data with soft-war e pac ka ges suc h as SPM (Statistical P ar ametric Ma pping) [ 42 ] or FMRIB Softwar e Libr ary (FSL) [ 43 ] can also lead to different outcomes . T hese findings emphasize the potential implications of the absence of standardized pipelines for handling complex data and how this can impact r esearc h outcomes. Efforts ar e being made to determine sources of variability and to de v elop homogeneous and standardized computing environments [ 44 ].

Metadata challenges
Another challenge is managing metadata [ 45 , 46 ], especially for complex, lar ge, m ultisite, heter ogeneous datasets [ 47 ]. In an ideal scenario, all metadata related to acquired datasets would be r eadil y accessible and sufficient for data sharing. In reality, they are not (yet). The associated metadata (such as the origin and type of a sample, experimental conditions, applied measurement tec hniques, de vices used, calibr ation methods, and units) are fr equentl y missing, incomplete, or onl y av ailable in fr a gmented form. Additionally, datasets often lack critical details such as accuracy and variability of data points, as well as the underlying data structure. For instance, curr entl y av ailable datasets may not possess the resolution, annotation, or labeling r equir ed for deep learning algorithms to be a pplied. Ev en if metadata are available, extracting meaningful insights from the data may require additional tools. Knowledge of dataset quality and accompanying metadata is incr easingl y crucial for ensuring r epr oducibility [ 48 , 49 ].
For most neur oima ging datasets, data annotation is highly essential. For example, when analyzing task-based data, the extent to whic h e v ents ar e clearl y documented determines an experiment's r epr oducibility. It is indeed important that metadata be informative about the dataset to be analyzed while following standardized ethical and quality measures. For instance, some projects in the consortium examine pain chronicity by monitoring pain patients over a period of days or years and collect various data (such as MRI and EEG) and metadata (such as pain ratings, response times, or error rate) at multiple time points. Associations between data and metadata are made to establish relationships between, for example, neural alterations and pain variables in patients with c hr onic pain [ 50 ] or changes in pain c hr onicity and associated neural networks with time [ 51 ]. Such studies could not be performed without sufficient and reliable documentation of metadata.
Mor eov er, a labor atory can gener ate a lar ge dataset fr om a single experiment or a single dataset fr om m ultiple experiments . T he collected metadata can be very complex and stored in multiple files with differ ent formats, whic h can only be read by the acquisition software or by customized codes written for internal use. In such cases, a consolidated strategy is necessary to unify the data into a single format that can be read by v arious softwar e a pplications and analyzed in an efficient and r epr oducible manner. Furthermore, depending on individual lab pr actices, r aw data and associated metadata may be distributed across different files or separ ate dir ectories. It r equir es additional effort to r ead and extract the metadata from their original raw data files and integrate them into a single file. Inter oper ability between file formats can be a technical issue if the appropriate software to read, view, and process the files is no longer a vailable . It is also possible that the format is no longer supported by an y softwar e, making it impossible to open the file.

Data storage and volume
The data volume varies substantially depending on the data modality, r anging fr om a fe w megabytes (e.g., questionnaire data) to ter abytes (e.g., high-r esolution fluor escence ima ging). Pr ojects involving large amounts of data generated from high-resolution fluor escence ima ging, volume electr on micr oscopy, electr ophysiology, or MRI can typically yield terabytes of data. Such data are often stored in dispersed locations and infr astructur es in v arious formats (often pr oprietary), r equiring a significant amount of time and effort to manage , utilize , and curate the data efficiently [ 52 ]. Researc hers, particularl y those working with high-dimensional data, r equir e consistent support for data stor a ge, timel y bac kups, and arc hiv al systems. Inefficient data stor a ge pr ocesses can lead to data integrity failures, accessibility issues, and increased operational costs.
Commercial cloud stor a ge solutions ar e av ailable, offering a wide range of general-purpose data backup and restoration services. Ho w e v er, their adoption may be challenging and limited for individual r esearc h labs or univ ersities due to differ ences in data types , volumes , priv acy r egulations, and budgets [ 53 ]. Cloud solutions can be expensiv e, especiall y when large amounts of data or complex computing workloads need to be stor ed. Besides, man y cloud pro viders ma y lack the necessary APIs , scripts , and tools to facilitate data migration onto analysis platforms and may not have sufficient data protection support for sensitive data (e.g., clinical data) [ 54 ].
Handling protected clinical data requires additional layers of security , privacy , and regulatory compliance. Universities may deal with sensitive data, such as research data or intellectual property, subject to strict regulations such as the European Union and the General Data Protection Regulation (GDPR [ 55 ]) and the Health Insurance Portability and Accountability Act (HIPAA). Cloud solutions r equir e a r eliable and fast Internet connection, and universities may have limited bandwidth that affects research activities . Besides , it can be difficult to migrate to a differ ent pr ovider or bac k to on-pr emises infr astructur e, r esulting in long-term dependency on a single vendor and the associated risks. Despite these challenges, cloud solutions can offer significant benefits, but r esearc h gr oups should e v aluate their data pr otection needs, choose compliant cloud stor a ge and bac kup services, and tak e ste ps to ensure that data ar e stor ed secur el y and in compliance with any relevant regulations or institutional policies. To meet these r equir ements, cloud pr oviders hav e de v eloped dedicated platforms and specialized backup and stor a ge solutions designed specifically for health care organizations. Ho w ever, these services cannot be easily implemented or adopted by individual labs or consortia.
Ensuring access to secure and optimal stor a ge solutions that can be integrated with w orkflo ws encompassing data acquisition, intermediate analysis, and archiving is thus a major challenge.

Challenges in data documentation
Data documentation presents a number of challenges, including the adoption of digital systems and laboratory inventory management systems for large consortia. Electronic laboratory notebooks are essential for data documentation (such as hypotheses, methods , observations , experimental protocols , notes). Many efforts over the past years have recognized the critical need for institution-wide adoption and implementation of an electronic laboratory notebook (ELN) [ 56 ].
For a large-scale neuroscience consortium spanning diverse experimental protocols, it is important to select an ELN that can pr ovide compr ehensiv e support for a wide range of experimental protocols and flexibility to add domain-specific features if required [ 57 ]. The initial challenge is to select an appropriate option that fits into the curr ent labor atory standards. In addition, a usable and sustainable ELN needs to be inter oper able and incorporated into existing data w orkflo ws . T here are obvious issues of user r esistance, expensiv e costs involv ed in the implementation, and secure configuration and maintenance, and the user will be ultimatel y r esponsible for mana g ing the dig ital system. Ther e ar e v arious open-source and pr oprietary options av ailable for use [ 58 ], but it is important to note that for proprietary options , documentation ma y exist in the form of vendor specifications or may be created and maintained within a global community. Ho w ever, it is possible that these options may not fulfill domain-specific r equir ements . In some cases , a k e y functionality that could support easy documentation is absent, and the available features may not be beneficial to users. Additionall y, ther e may not be an automated end-to-end solution that enables users to document experiments, which can make the process timeconsuming and tedious when performed manually.
When choosing an ELN, it is crucial to consider potential legal and data privacy concerns [ 58 ]. ELNs are digital resources that allow multiple users to access confidential information. It is therefore essential to select an ELN that is designed in compliance with applicable la ws , regulations , and ethical standards in order to ensure legal and data privacy regulations. It is also important to consider security measures such as encryption and user authentication to maintain the confidentiality and security of stored data. Lastl y, r e vie wing the terms of service is critical to understand how data ar e used, stor ed, and shar ed, as well as an y r estrictions of use. By taking these factors into account, organizations can ensure the security and compliance of their data. Fortunatel y, r esources suc h as the ELN Matrix created by the Harvard Biomedical Data Mana gement Gr oup and the ELN Finder, whic h pr ovides information on various software options, can be incredibly useful in this process [ 59 , 60 ].

Data sharing and dissemination challenges
Ther e ar e significant c hallenges in or ganizing datasets in a useful manner to enable sharing with collabor ators. Ev en if a dedicated central data storage infrastructure is a vailable , insufficient quality contr ol measur es as well as time constr aints hav e a direct impact on data-sharing practices. Especially in small research groups or individual projects, limited funding and sustainable resources dir ectl y impact the le v el of data sharing and reuse. Another significant issue is motivating researchers to share data publicl y. Researc hers ar e often hesitant to openl y shar e data due to concerns of not receiving credit, reducing their own chances of performing secondary studies, mishandling data sensitivity, and facing criticism about data quality. Despite an increasing number of r esearc h or ganizations , academic journals , and large-scale projects supporting extra efforts to build realistic data sharing tec hniques, suc h pr ocedur es hav e not yet become standard researc h pr actice [ 61 ].
Furthermor e, while man y journals r equir e open data sharing and dataset submission to public repositories prior to manuscript submission, there is limited oversight of data-sharing policies. Additionall y, c hoosing a suitable public repository could be difficult for a number of r easons. Researc hers should confirm that the repository complies with the research data regulations of their host institution before contributing datasets to open repositories. Finding a suitable subject-specific repository for a given dataset could be challenging. An alternative is to submit data to a generalpur pose r epository, but ther e can be issues regarding data visibility as such repositories might not be well recognized within a particular field of r esearc h.
Submitting data to gener al r epositories can pose significant issues, including inadequate support for certain types and formats of data [ 62 , 63 ]. For example, if the datasets are in a nonstandard format, the repository may not be able to process them corr ectl y or e v en accept them. Also, gener al r epositories may lac k specialized tools or services for conv erting, or ganizing, or anal yzing highly specific data such as medical records or geospatial data. Without the necessary support, r esearc hers may not be able to make full use of the data or e v en access them. Gener al r epositories may also lack the same le v el of curation and organization as for specialized repositories, leading to difficulty in evaluating the quality and r ele v ance of the data, ther eby r educing r epr oducibility and hindering the building on pr e vious r esearc h. Ther efor e, it is recommended that researchers submit their data to specialized repositories that are tailored to their field of research. For example, neur oscience-specific r epositories suc h as OpenNeur o ar e designed to accommodate the unique needs of neuroscience data, ar e mana ged by experts in the field, and provide necessary infrastructur e to ensur e the safe and secur e stor a ge of data. Furthermore, they often offer additional services, including data analysis, curation, and visualization tools, allowing r esearc hers to better understand and to use the data.
Even after identifying a suitable r epository, bur eaucr atic pr ocedures and demands for publishing datasets in open data repositories r equir e additional work, including conv erting files to the r equir ed format, compiling consent forms and contr acts, r emoving sensitive information, and preparing documentation. Finally, maintenance funding must be taken into account because many r epositories c har ge a fee based on the data volume. In the latter stages of the data life cycle, these factors can hamper discoverability and reusability.

Challenges due to sensitive data
Projects involving human subject data or other sensitive data m ust adher e to strict data priv acy r egulations for the stor a ge, use, and sharing of r esearc h data [ 64 ]. Sensitiv e data containing potentially identifying information must be anonymized or pseudonymized prior to making the data public to protect participant confidentiality. Maintaining such high ethical standards can be costly and time-consuming, adding further burden to researchers [ 65 ]. Long-term preservation and sharing of sensitive data lar gel y depend on informed consent, data r euse a gr eements and policies, and the type of archiving solution or data repository used. Each step of handling sensitive data must protect privacy and identity protection rights, often through deidentification or anon ymization. Ther e ar e distinct sets of regulations for full anon ymization v ersus deidentification of data [ 65 ]. Ther efor e, it is recommended to retain multiple versions of the data: one suitable for public release and one suitable for further r esearc h but available on a highly restricted basis [ 66 ]. These considerations can lead to increased data duplication and storage needs.
The sharing of sensitive data between collaborators located in differ ent locations r equir es additional effort, as data contr ollers need to ensure that data protection requirements are met in both the original location where the data were collected and the collaborator's location [ 67 ]. Furthermore, external collaborations across universities can present logistical challenges in the form of access and security entitlements . T hese concerns are compounded when the collection of sensitive data is part of the r esearc h pr oject, or for collaborations with researchers embedded in clinical settings. Another major obstacle to sharing confidential data with external parties is the cost involved in adopting secure data-sharing platforms, as well as major risk of participants being identified.
In this context, r esearc hers r equir e consistent training and education that promote responsible research conduct and to adhere to institutional and discipline-specific data management policies, including risks of data disclosure , confidentiality obligations , privacy principles, and network security.

Challenges in behavioral experiments
The increasing number of collaborative studies may be hampered by challenges in standardizing behavioral experiments across laboratories (e.g., continuous animal mov ement r ecordings, mouse tr ajectories). Ther e ar e no specific comm unity data standards for storing beha vioral datasets , which has a direct impact on data sharing. Also, r esearc h labs may not have access to modern tools for extracting and analyzing behavior because their implementation may r equir e adv anced computational skills.
Another challenge that behavioral data present are reproducibility issues with regards to experimental results because it is often difficult to replicate the exact same conditions in which the experiment was conducted across the laboratories or e v en within the same laboratory [ 68 ]. It is indeed difficult to standardize metadata across behavioral experiments due to various factors that are difficult to control (confounding variables), such as laboratory environment (e.g., time of testing during light or dark phases, housing system for rodent experiments, auditory sounds) and experimenter bias, which can lead to inconsistencies in data collection.
The lack of publicly available behavioral datasets with accurate annotations is a major impediment to benchmarking algorithms used in behavioral analysis [ 69 ]. These algorithms can range from simple statistical tests to more complex machine learning models that classify, cluster, or extr act featur es fr om the behavior data [ 70 , 71 ]. Ho w e v er, due to the complexity of behavior patterns, consistent labeling of data is challenging, and the time and resources r equir ed for collecting and labeling datasets make it expensive for many labs to obtain high-quality datasets for testing and comparing algorithms.
To address these issues, it is essential to de v elop a centralized database for storing methods and experimental protocols of beha vioral assa ys , parameters (e .g., sex, age, strain of the animal, genotype, marking, testing conditions), data and metadata files generated in the task (such as behavioral responses and compressed video and audio files), and a common fr ame work that supports further analysis and visualization [ 72 ].

RDM challenges for specific projects
Acr oss the consortium, se v er al pr ojects pr esented specific c hallenges in RDM, either in data organization and go vernance , sheer data volume , logistics , or for collabor ativ e data sharing. In these cases, effectiv e data mana gement is integr al to pr oject success and may r equir e customized strategies and resour ces. Belo w, w e list examples r epr esenting the consortium extreme cases.

Electr oph ysiology with high-density probes
A few of the animal projects in the consortium make use of technologies such as high-density Neuropixels probes [ 73 ]. Neuropixels datasets are often large ( ∼80 GB/hour) and computationally demanding, which can make it difficult to scale spike-sorting w orkflo ws across different labs and datasets. Data storage requir ements incr ease due to the significant amounts of derived data needed for intermediate pr ocessing (suc h as filtering and spike sorting) as well as stimulation and/or behavioral parameters (such as optogenetic stimulation, motion or whisker tracking, and task performance). Analysis and postprocessing may of-ten r equir e computationall y intensiv e algorithms and har dw are acceleration to handle data that cannot be loaded into local memory [ 74 ].
Differ ences in r ecording conditions , spike-sorting algorithms , and data pr epr ocessing can lead to significant outcome variability. The real-time processing requirements for closed-loop experiments only serve to exacerbate these issues. Important parameters initially recorded from raw data (e.g., animal ar ousal/anesthesia le v el, impedance measur ements) might be excluded or lost in derived datasets used for analysis, which can affect the accuracy of the results. Complex hierarchies of derived data and multimodal datasets (e.g., accelerometer, whisker or pupil tr ac king) collected with differ ent instruments compound these issues. It can be challenging to validate and reproduce results, because the use of various algorithms and parameters can pr oduce differ ent outcomes . T her efor e, it is essential to car efull y consider and account for all r ele v ant par ameters and sources of variability in the analysis of complex datasets.

Large-scale in vivo 2-photon calcium imaging
Some r odent pr ojects within the consortium acquir e lar ge amounts of data, collected over months [ 75 ]. For example, data acquir ed using ima ging tec hniques suc h as fluor escence ima ging or 2-photon microscopy calcium imaging (2-photon imaging) generate a large volume of spatiotemporal imaging data (up to 100 GB/hour), whic h r equir e rigor ous pr epr ocessing steps (ima ge segmentation, denoising, motion correction, manipulation and handling of large video files, and neural activity deconvolution) using high-throughput computing [ 76 ]. The downstream processing and analysis of the resulting datasets generated over the course of months is often challenging and requires complex w orkflo ws [ 77 ]. A few open-source software solutions such as CalmAn [ 78 ] and EZCalcium [ 79 ] have been proposed to deal with these challenges. Ho w e v er, compar ativ e anal ysis studies hav e r e v ealed that the neural assemblies (collection of neurons that are activated sim ultaneousl y in r esponse to a particular stimulus and form assemblies) r ecov er ed fr om 2-photon ima ging datasets can v ary significantly depending on the algorithms used. Some algorithms have been found to have high precision and slow runtimes, while others have faster runtimes but lo w er accurac y [ 80 ]. Another issue is that many studies include synthetic or benchmarking datasets, but the production and analysis of these datasets r equir e c hallenging calculations, raising the computational complexity and costs . T his highlights the need for more scalable and fully automated w orkflo ws that can be run on HPC clusters [ 75 ], in order to ensur e r eliability and performance [ 20 ]. Existing software solutions can be used for analysis and visualization of datasets, but any adopted data and metadata standards must be interoperable with these tools.

Challenges associated with human-animal tandem projects
Projects collecting data from both human and animal models pose se v er al c hallenges, suc h as the systematic and parallel implementation of experimental designs , techniques , and analysis tools [ 81 , 82 ]. Data mana gement pr ocesses to create harmonized datasets and analysis w orkflo ws while establishing clear linkages between human and animal models are difficult, and standards for integrating data are somewhat ad hoc. In collaborative projects involving m ultiple labor atories working on m ultiple species, the integration of data and analysis should happen systematically, not only sporadically. Apart from the sheer scale of such collabor ations involving m ultiple r esearc h ar eas and the m ultimodal RDM issues discussed abo ve , these tandem projects require a se-cure platform for data transfer between different sites (e.g., laboratories and clinics) with different security permissions and datahandling standards [ 83 ].

Consortium-wide RDM implementation phase
The implementation of an RDM strategy for a large consortium is primarily based on the various types of data generated across r esearc h pr ojects, as well as on practical methods for organizing and managing the data. Our priority was to adequately characterize the consortium's needs before committing to any specific resources. One of the primary goals of identifying common r equir ements was to ensure that the best data management practices could be implemented across the consortium while taking individual lab practices into account.
Dir ect involv ement with r esearc hers during the planning and initial implementation phases was crucial to identifying the most helpful RDM measures . T hese measures ma y be as simple as coor dinating communication betw een cor e IT staff and r esearc hers and facilitating access to institutional or other preexisting resour ces. We dev oted significant time to initially gathering information about publicly available tools and services that would be useful to the diverse projects within the consortium. Given the large number of laboratories from various institutions participating in the consortium and the increasing number of requirement c hanges ov er the course of a pr oject, information was gather ed in a variety of wa ys (e .g., virtual individual interviews with project principal investigators (PIs), discussions during online data seminars led by the CRC data manager, and personal meetings with experimentalists and PhD students). Data discussions and regular communication with consortium members have greatly aided our assessment a ppr oac h.
In addition, we de v eloped an RDM assessment questionnaire in order to tailor the data management solutions to the common needs of the projects . T he PIs or project responsible persons were r equir ed to respond to a variety of data management questions about the major challenges they faced when managing data in their labs (see Supplementary Information 2).
The RDM assessment questionnaire included questions about the types of experimental models, acquisition methods and techniques, types of analysis tools and software, data modalities, raw and intermediate file formats, w orkflo ws for data pr epr ocessing and anal ysis, pr ocedur es for sharing and publishing datasets, and so on. Additionally, we discussed challenges in publishing data and metadata in open data repositories.
The outcome from the RDM assessment questionnaire was then used to implement common RDM solutions for CRC projects, such as identifying and targeting data stor a ge, or ganizing data, and sharing resources. In terms of research data and technological adv ancements, the surv ey r esponse was extr emel y div erse. The most common challenge reported by the consortium's resear chers w as the sharing of large-scale datasets with collaborators and to efficiently curate the data from data archives and repositories after the project was finished.

Assessment of data management r equir ements in the Heidelberg Pain Consortium
To identify common RDM measures, we first examined the commonalities between all pr ojects, suc h as the type of population studied (rodents , humans , or tandem), follo w ed b y the type of data modalities acquired (e.g., neurophysiology, neuroimaging, and behavior). Human projects include data collected from both healthy individuals and patients with various clinical con-ditions (e.g., c hr onic bac k pain, se v er e depr ession, diabetic neur opathy) and r odent pr ojects utilize mice as animal models ( Fig. 1 A).
We further categorized projects into subgroups based on common data modalities that were being acquired. Fig. 1 B depicts an ov ervie w of the various data types collected across the consortium pr ojects. Neur ophysiology data (including electr ophysiology and cellular physiology) are the most frequent data category collected across all studies (i.e., 83% for animal and tandem projects and 100% for human pr ojects). Ima ging, behavior al, and genetic data are collected in similar proportions in human projects (75%), wher eas psyc hometric data ar e collected by all human projects. Ima ging data ar e collected in 41.2% of the animal projects and 83% of the tandem projects . Beha vioral data (e.g., various pain models) are collected in 58% of the animal projects and 66.67% of the tandem projects.
Our consortium includes r esearc h pr ojects that typicall y collect data from a wide range of methods and techniques, such as electr ophysiology, neur oima ging, extr acellular and intr acellular signals, and 2-photon imaging for rodents; behavior, including stress and fear assessment in humans and rodents; and multiomics datasets ( Fig. 1 B). Rodent and human projects use comparable methods, such as MRI of both the brain and peripheral nerves; electrophysiology, including EEG and MEG for humans; extra-and intracellular signals and 2-photon imaging for rodents; and brain stimulation methods, including transcranial magnetic stimulation and tr anscr anial electrical stimulation. Additional specific methods in humans are peripheral physiology (e.g., heart rate, blood pr essur e, sensory pr ofiles), virtual r eality, psyc hometrics, and daily assessments of psychological methods such as EMA. Specific methods for rodents include optogenetics.
We gathered information on optimal stor a ge solutions for projects, and the response was diverse, as some projects acquir ed lar ge numbers of datasets (e.g., r anging fr om 5 TB/day to 1 petabyte), whereas other projects acquired relatively smaller datasets (e.g., a few gigabytes per month). For instance, some pr ojects involv ed the continuous r ecording of neur ophysiology datasets from high-density probes for a few days or a week at a time, whic h can gener ate up to 100 TB of data. We documented that 80% of the projects were already utilizing university infrastructure for data storage, whereas many human projects utilized individual lab servers.
We then collected information about file formats utilized for collecting and pr epr ocessing r aw data fr om differ ent acquisition systems and a wide range of methods. Given the complexity and diversity of experiments and the different volume of data collected in the consortium, the acquired file formats are most often highly specific to certain data types (such as time series , e .g., volta ge tr aces, ima ge stac ks, stim uli, or behavior) or acquisition or recording devices. Several projects require the development of new tools and software for migration to open data standards, resulting in the need for additional resources and support from the CRC.
Electrophysiology experiments, equipment, and analysis pipelines, in particular, are customized for each project and generate data in a variety of file formats. Data are collected using v arious tec hniques and experimental designs, suc h as patc h clamp to tetrodes in fr eel y moving animals and high-density silicon pr obe r ecordings . T he ste ps for pre processing for intracellular , juxtacellular , and extracellular techniques are frequently customized. An y measur es to standardize must be compatible with existing lab analysis tools and data-processing methods. Despite the fact that a number of comm unity-de v eloped elec-trophysiology metadata and data standards are available and e volving, they hav e not yet been widel y adopted.
We also collected information about the most common preprocessing and analysis software (e.g., IgorPro [ 84 ], ImageJ [ 85 ], MATLAB) utilized across different projects. Our assessment also included information regarding projects using electronic lab notebooks and those using traditional handwritten notebooks. Additionally, we assessed the definition of user permission for data access, protocols for data sharing, short-and long-term stor a ge needs, and implementation costs.

RDM communication and exchange
Se v er al neur oscience-specific RDM solutions alr eady exist, including software and infrastructures for streamlining data collection and acquisition pr otocols, collabor ativ e data analysis and visualization pac ka ges, and data sharing and archiving platforms [ 66 , 86-88 ]. Our initial observation while implementing RDM strategies was that many researchers were not aware of the benefits of existing r esources, partl y due to uncertainties regarding the bur eaucr atic pr ocedur es, the GDPR, and mor e often, the tec hnical r equir ements for easy integration of these resources into existing laboratory practices [ 89 , 90 ]. T herefore , we put great emphasis on promoting and encouraging the use of preexisting resources that meet the needs of our consortium or that help in a particular use case. An important aspect was to find a balanced a ppr oac h that encour a ges an a ppr opriate degr ee of integr ation of existing resources with realistic domain specificity. We curated a list of both generic and neur oscience-specific RDM r esources, both on the consortium/institutional (internal) and national and international (external) le v els . T he list can be accessed in the data management section of our CRC website [ 91 ].

Data infrastructur e (platf orms f or storage , organization, analysis, and sharing of data)
A tec hnical infr astructur e was made available to consortium members based on the individual CRC project's needs and demands. We aimed for simple and efficient solutions for secure data transfer between collaborators with controlled access, all while balancing ease of access for r esearc h. Eac h of these services and their underlying technology, properties (e.g., sharing possibilities , a v ailability on HPC, bac kup, v ersioning, access), tec hnological foundation, and usage scenario are explained in the next section.

Data management plans for CRC projects
We designed 3 data management plan (DMP) templates after defining and categorizing the data management needs for each project depending on its experimental model type: human, animal, and human-animal tandem (see Supplementary Information 3.1, 3.2, and 3.3, r espectiv el y). The templates can also be found on Zenodo [ 92 , 93 ]. DMP templates for animal, human, and tandem projects can differ depending on the scope of the project, the type of data collected, and how the data will be managed, but the major difference is the ethical considerations associated with each experimental model type. Animal projects, for instance , ma y necessitate additional safety protocols for the stor a ge and mana gement of animal tissue samples, wher eas human pr ojects r equir e mor e stringent r egulations and ov ersight for the ethical processing of human data [ 94 ]. Additionally, DMPs for human projects may include the collection of sensitive data (personall y identifiable information), whic h m ust be secur el y stor ed and shar ed according to r egulations . T he DMP should adhere to GDPR-compliant guidelines for handling of personal data. It should specify information and authentication measures for Figur e 1: T he Heidelber g P ain Consortium inv estigates humans, r odents, and tandem (human and rodents) using various modalities: neuroimaging, neurophysiology, behavior, psychometrics, and genetics (A). Each data modality can be recorded using different techniques: MRI: magnetic resonance imaging; PET: positron emission tomography; fNIRS: functional near-infrared spectroscopy; NIBS: noninvasive brain stimulation; EEG: electr oencephalogr a phy; MEG: ma gnetoencephalogr a phy; Genetics; 2-photon imaging; Behavior; EMA: ecological momentary assessment; psychometrics; physiology; VR: virtual reality; optogenetics (B). reuse and recovery, as well as deidentification procedures for datasets involving human participants before sharing. Additionally, DMP should include details of the publication process for anonymized data. Human-animal tandem projects are more concerned with resource allocation and special protocols for the integr ation of differ ent data types collected fr om m ultiple sources. It is important to ensur e accur acy and consistency across all sources.
The DMP states minim um r equir ements for metadata that must be provided for long-term preservation and secondary analysis of r esearc h data. It also contains the information about data migration and access by third parties or future collaborators even after the project ends.

Data storage solutions
Projects in the CRC generate massive volumes of data of various types and r el y on data inter oper ability among labs. It is str ongl y advised that r esearc hers secur el y stor e full datasets (e.g., r aw, pr epr ocessed, anal ysis files, codes) associated with published findings and results, as this promotes the consortium's goal of further engaging in open science.
T he CRC pro vides support for v arious stor a ge solutions for all stages of the data life cycle, and the best option for a research project is determined by the type of data being collected, the size of the data, the security r equir ements, and the scalability of the platform. Factors such as interoperability with the existing in-fr astructur e of the project group, reliability and accessibility for a particular stor a ge solution, and av ailability of user tr aining and service support are also taken into consideration. For example, if the r esearc h pr oject involv es collecting large amounts of data, then a cloud-based stor a ge platform may be the best option. Additionall y, cloud-based stor a ge platforms provide access to data fr om an ywher e with an Internet connection, making it easy for r esearc hers to collabor ate and shar e data with collea gues. If the data ar e sensitiv e, then a secur e, on-pr emises stor a ge solution may be the best choice because it can be customized to meet the specific security needs of the or ganization, suc h as encryption, authentication, and access contr ol. Additionall y, if the r esearc h pr oject r equir es scalability (e.g., pr ojects collecting ter abytes of data from methods such as optogenetics, electrophysiology, and calcium imaging), then a platform that can easily scale up or down may be the best option. Ultimately, the best data stor a ge platform for a r esearc h pr oject will depend on the specific needs of the pr oject, and univ ersity-a ppr ov ed data stor a ge services ar e r ecommended to guarantee data privacy and confidentiality (see Fig. 2 ).
Most of the CRC r esearc hers ar e encour a ged to use our internal data stor a ge platform SDS@hd (SDS, Scientific Data Stor a ge, with a capacity of 20 petabytes), a central service for securely storing scientific data [ 95 ]. To facilitate easy data sharing between internal collaborators, all datasets collected from rodent experiments are shared across different research teams via SDS@hd, whic h stor es experimental pr otocols, r aw and pr epr ocessed data, session (e.g., session start time), mouse information (e.g., animal weight), task files (e .g., beha vior al r esponses , videos , audio data), metadata parameters (e .g., dimensions , pixel types , and instrumentation settings). Suc h pr ocedur e is intended for data that ar e fr equentl y accessed ("hot data"). A common stor a ge place ("Speic hervorhaben," meaning data stor a ge pr ojects) is r equested for collabor ativ e pr ojects or work gr oups to ensur e that data ar e easily accessible to all project members with proper authentications (univ ersity cr edentials). The Heidelber g Univ ersity network connects the consortium's labs and university departments, using a 10-GB capacity network to expedite data transfer among institutes and facilities. Using SDS@hd dr asticall y incr eases the ease of data access for shared projects and data safety by virtue of automated mirroring for data backup. Datasets collected from confocal or 2-photon microscopes are stored on acquisition computers tempor aril y (a few days), allowing researchers to quickly access, c hec k, and tr ansfer the data to a secur e data stor a ge platform on SDS@hd for further pr epr ocessing and analysis.
Once the datasets ar e full y copied and back ed up, indi vidual users must ensure that the large datasets are timely (usually in a few days or a week depending on data volume) r emov ed fr om the acquisition computers to free up the acquisition stor a ge system for new data or for another user. The reason to remove data from acquisition computers is that such computers are often not designed for long-term stor a ge and can be vulnerable to har dw are failures, data corruption, or security breaches. To mitigate these risks, it is a common practice to copy the data immediately to the data stor a ge platform to ensure long-term data pr eserv ation, security , and accessibility .
Apart from its storage and fast data transfer capacity, another potential reason for storing large datasets on SDS@hd is its direct access to other university platforms such as HPC systems (explained in the next section). Other university storage solutions that can be requested via support fr om CRC data mana ger include a SERVER BACKUP [ 96 ], which is a service for data storage for servers (based on a data protection and r ecov ery softwar e, i.e., IBM Spectrum Protect [ISP] [ 97 ]; a CLIENT BACKUP service [ 98 ], which can be accessed via on all operating systems via Duplicati softwar e for secur e stor a ge of workstations and PCs; and HEIVOL-I [ 99 ], a service for creating network drives for university institutes and facilities). Similar services are available at the other participating CRC labs from other institutions.
For human data stor a ge (including sensitiv e data), r esearc hers use a stor a ge serv er with restricted access (Dell Isilon server with a large storage and archival capacity). Designated personnel have the authority and responsibility to enable access to internal collaborators. When necessary, access can be given to external collaborators by assigning guest accounts with a data-sharing a gr eement in place.

Data pr ocessing, analysis , and visualization
Se v er al pr ojects and labor atories in the consortium use labor atory-based anal ysis infr astructur e, suc h as local computers, shar ed anal ysis workstations (labor atory computers with GPUs and preinstalled acquisition and analysis tools that are shared by several members), and computational servers that are run by individual laboratories or groups.
The CRC highlights the importance of k ee ping tr ac k of e v ery step, from initial data recording to the analysis, and proper documentation of analysis code , pipelines , and scripts . As a constructive starting point, the CRC 1158 data manager has set up a dedicated code space on GitHub [ 100 ], wher e m ultiple r epositories with analysis code and scripts can be hosted and shared for each CRC 1158 project. The CRC 1158's data management organization r epositories ar e maintained by the data mana ger, and access is giv en onl y to the authorized pr oject members.
In addition to local infr astructur e and computing serv ers, ther e is university infrastructure available for more demanding datapr ocessing tasks, suc h as running computationall y intensiv e analyses of heterogeneous and large-scale imaging datasets collected from humans and rodent projects. An application for access to these services can be made by individual laboratories with an initial application and usage support from the data manager. To seamlessl y integr ate the data analysis with the setup and execution of preclinical experiments, a platform such as the one proposed here was necessary. Some of the CRC projects are utilizing bwForCluster MLS&WISO [ 101 ], and a detailed tutorial on access and use is made available [ 102 ] This eliminates administrativ e and tec hnical barriers to performing computationall y intensiv e tasks suc h as lar ge-scale modeling, sim ulation, and anal ysis pr ojects (e.g., Neur opixels systems). The HPC allows job scheduling using SLURM [ 103 ] and also sets up r epr oducible computing en vironments (e .g., Docker [ 104 ], Singularity [ 105 ]) to optionally run the modules on the HPC for particularly large data sets that ar e str eamed dir ectl y to SDS@hd during acquisition, for example, c hr onic r ecordings with dense electr ode arr ay or ima ge segmentation for c hr onic ima ging (miniscope) with the perspectiv e of running standardized analysis workflows. Allowing users to access lar ge datasets stor ed in SDS@hd without downloading them to their local computer aids in the seamless integration of data analysis pr ocedur es , sa ving o v er all computational time and costs . T he data can be accessed from the bwHPC cluster using the same protocols as for local stor a ge, suc h as NFS, SMB, and FTP. This service also allows users to access data stored in SDS@hd from multiple bwHPC nodes sim ultaneousl y, thus incr easing ov er all speed of data access. By utilizing direct data access, scientists are able to take adv anta ge of the increased computing po w er available in HPC systems and gain access to data stored in a scientific data stor a ge system.
While writing this article, bwForCluster Helix [ 106 ], a successor of the current HPC system, was made available to users . T he Helix component will enable the use of seamless, cross-system w orkflo ws for processing and analysis of large amounts of data.
In our consortium's case, data management efforts had led to an increased user base for the HPC machines. Many factors played into a user's decision to use a particular HPC mac hine, suc h as its performance , cost, and a v ailability. Data mana gement efforts had made the HPC machines more attractive to users by providing individual support and training for access and use . T he amount of training necessary to promote the use of HPC machines in a r esearc h lab depended on the particular needs and existing infr astructur e. Gener all y, tr aining and seminars cov er ed topics suc h as high-le v el usa ge of pr ogr amming langua ges, high-performance computing paradigms, and best practices for using HPC machines for neuroscience data processing and analysis. It also included guidance on how to design and optimize applications for HPC systems (e.g., refactoring of tools for direct access or use on HPC and bwVISU). A lab might also need to provide additional training for data management and analysis or for using specific software pac ka ges.
Similarl y, for pr ocessing massiv e datasets (e.g., neur oima ging), some pr ojects ar e utilizing the heiCLOUD [ 107 ], an infr astructur eas-a-service cloud service that provides virtual machines that may be customized and utilized as needed for the project. A possible scenario for heiCLOUD usage within our consortium is to install complex and computationall y expensiv e softwar e pac kages and perform concurrent processing of massive neuroimaging datasets . T his pro vides po w erful w orkstations for data analysis and can be especially useful for collabor ativ e r esearc h pr ojects that r equir e data sharing between m ultiple r esearc h teams.
Another service that is fr equentl y accessed by CRC members is heiBOX [ 108 ], a secure sync-and-share service hosted on he-iCLOUD. This service is similar to commercial cloud stor a ge services lik e Dropbo x and Google Dri ve and allows users to sa ve , sync hr onize, shar e, publish, and jointl y edit files. heiBOX is based on the Seafile software [ 109 ] and allows users to search for text files, PDF files, and Office files in unencrypted libraries using a full-text searc h. Additionall y, Office files can be edited by multiple people, and files and folders can be tagged and commented on. Markdown documents can be used to create private or public wikis. heiBOX also provides backup, synchronization, and storage of small research data and document files, and guest accounts can be requested for data exchange with external collaborators . T he most common use of heiBOX in our consortium is to share documents such as CRC meetings notes, data seminar and workshops presentations, individual projects DMPs, manuscripts, and figures.
Another application that some of the CRC projects are utilizing is bwVISU [ 37 ], a remote service for scientists (universities in Baden-Württember g state, German y), as well as the corr esponding softwar e stac k to deploy suc h a service on pr emises. It has an inter activ e web fr ont end that supports lar ge-scale data analysis and visualization without m uc h human interv ention. Our RDM services also include pr oviding tec hnical assistance with the refactoring of lab-customized preprocessing analysis pipelines (MATLAB and Python scripts) into mor e or ganized w orkflo ws/GUI that can be run on HPC applications.

Metadata documentation and standardization
To standardize datasets generated within the collaborative studies across the consortium, we prioritized data documentation as a k e y first step. We encour a ged consortium-wide adoption of ELNs to help r esearc hers document experimental protocols in wellannotated electronic form at an early stage of the research project [ 110 ]. In order to select an a ppr opriate option for our consortium projects and take individual lab r equir ements into account, we tested multiple ELN options based on factors such as licensing, security, implementation and maintenance costs, ease of access and integration with existing resources, and domain-specific features that may be required. We selected 2 options: elabFTW [ 111 , 112 ].
The Competence Centre for Research Data (Kompetenzzentrum Forschungsdaten [KFD]), a joint institution run by the Library and Computing Centre at the University of Heidelberg, established a fully encrypted, web-based elabFTW instance [ 113 ] with secure cloud-based data storage. Our computing center's local installation of the elabFTW, including secure cloud-based data stor a ge, has pr ov en beneficial to individual r esearc hers in our consortium, serving as a central repository for shared experimental protocols. To facilitate the transfer of experimental protocols from traditional notebooks or digital documents to ELN in a consistent manner, additional time and effort were required from project members . To o ver come this potential bur den, w e created templates for the most common types of experiments performed in a single project (based on design protocols, biological methods). By de v eloping suc h templates, w e w er e able to guar antee the seamless integration of existing protocols into ELN and ensure easy access to them at all stages of the experiment. Despite the initial investment, most labs found the benefits of ELNs outweighed any additional burden by saving time and effort in the long run.
In our RDM fr ame w ork, w e use ELNs as a platform to document experimental protocols alongside minimal metadata generated automatically during an experiment. Most commonly, we use elabFTW, which accepts JavaScript Object Notation (JSON) files. elabFTW acts as a "notebook," tr ac king both primary data (experiment findings , measurements , etc.) and metadata (date , time , author, units, used inventory, etc.). The experimental metadata (e.g., microscope specifications, data acquisition settings) is stored in a standardized manner using a generic metadata file format that is compatible with open file formats, such as JSON or XML.
Ho w e v er, ELNs ar e not usuall y designed as full-fledged metadata systems and may lack domain-specific functionalities, such as support for file formats specific to neur oscience. Integr ating metadata generated during different stages of a neuroscience researc h pr oject (suc h as experimental design, data acquisition and pr epr ocessing, statistical anal ysis, visualization, and dissemination) into ELN presents a significant challenge. Most ELNs do not offer sufficient support for the diverse data or metadata file formats generated during these intermediate steps, which makes incor por ating additional metadata challenging without an automated tool or API, which can be time-consuming.
In addition to the limitations of ELNs, man y neur oscience datasets (e.g., electrophysiology experiments such as single-unit recordings or local field potential [LFP] recordings) lack consistent metadata sc hema e v en at the most basic le v el. This means there is no standardized structure or format and terminology for storing metadata such as experimental parameters (e.g., stimulus type, duration, intensity, and location), animal or subject information (e .g., species , age , sex, and w eight), recor ding equipment and settings (e.g., amplifier type, sampling rate, and filtering), data-pr ocessing par ameters, and anal ysis methods. To add, search, filter, or use various types of metadata effectively, specialized tools are often necessary. These tools are designed to handle the complexity and diversity of metadata formats and to streamline the process of metadata management and integration with data analysis w orkflo ws . Without such tools , it can be challenging and time-consuming to work with large, complex datasets that contain multiple types of metadata.
For instance , when con verting complete datasets , suc h as r aw, pr epr ocessed, and anal yzed files from an electrophysiology experiment, into a standardized open data format, the resulting basic metadata file is fr equentl y insufficient and lac ks important information such as analysis parameters, spike sorting, or filtering parameters.
Ther efor e, we face the challenge of integrating comprehensive metadata, including experimental, acquisition, and analytical metadata, into a single file with a format compatible with open data standards. To address these challenges of inconsistent metadata sc hemas, lac k of tools for metadata management, and insufficient support for domain-specific file formats in ELNs, we de v eloped custom tools (explained in the next section), which enable to standardize experiments or data modality-specific standardization and to automate the data documentation process.

Data standardization in rodent projects
In order to devise practical solutions for harmonizing diverse neurophysiology datasets with diverse file formats, we divided our standardization a ppr oac h into 3 main steps: (i) compr ehensiv e metadata documentation, (ii) adoption of open data standards, and (iii) standardized pr epr ocessing and anal ysis w orkflo w HPC environment.
To ensur e compr ehensiv e metadata documentation and standardization of neurophysiology datasets collected across rodent projects of the consortium, we collaborated with Catalyst Neuro, a neur oscience softwar e solutions compan y [ 114 ], to design a webbased metadata standardization GUI. The source code and installation guide are available for use [ 115 ]. The metadata handling GUI allo ws standar dized documentation of metadata (experimental information, acquisition and analytical parameters, etc.) collected from neurophysiology experiments [ 116 ].
The metadata handling GUI generates JSON files based on the initial set of JSON schemas for different experiment types (e.g., extr acellular electr oph ysiology and optical ph ysiology). These sc hemas incor por ate ontologies (structur ed and contr olled vocabularies) to describe the data and associated metadata, including fields such as the type of population studied, type of data collected, date and duration of the experiment, and the equipment used. The GUI uses a centralized data dictionary that contains k e y metadata fields and possible values sourced from the consortium.
In addition to facilitating routine metadata entry through default field v alues, the GUI ensur es that all necessary information is included and k e pt in a consistent format. It generates JSON files that incor por ate standardized details of the experiment and analysis parameters, as well as raw file information. These files can be sav ed locall y or centr all y and imported into ELNs and open data repositories.
While eLabFTW is useful for recording experimental protocols, the GUI provides extra features such as automatic data validation and support for specific metadata formats. It can be easily customized to fit pr oject r equir ements, facilitating mor e flexible and standardized metadata management across different platforms and tools. Ov er all, the GUI impr ov es the efficiency and r eliability of w orkflo ws within the lab and enables submission of complete datasets for archiving and future use.
Standardizing electrophysiology datasets recorded with Neur opixels pr obes involv es the use of an open-source acquisition system (Neur al ynx system [ 117 ] and open-source acquisition softwar e suc h as SpikeGLX [ 118 ] and Open Ephys [ 119 ]). Se v er al electr ophysiology pr ojects within the consortium focus on using the Neurodata Without Borders (NWB) data standard [ 120 ]. The NWB 2.0 format is based on the Hier arc hical Data Format version 5 (HDF5) and organizes files in a hier arc hical structur e that contains metadata, data, and processing code. It supports a wide range of data modalities, including electrophysiology (extracellular and intracellular recordings, electrocorticography) and optophysiology (2-photon ima ging, fluor escent wide-field ima ges, etc.). NWB 2.0 format contains all the metadata r equir ed to specify the neurophysiology experiment parameters, such as voltage having a sampling rate and being connected to electrodes, and the data can be shared between labs in a fully standardized format.
The NWB 2.0 data standard uses the JSON format for metadata files and a JSON schema to define the metadata structure and validate the metadata content. This ensures that the metadata conform to a consistent format and contain all the necessary information for data sharing and reuse . T he JSON metadata files in NWB 2.0 ar e typicall y associated with HDF5 files that store the actual data, enabling efficient stor a ge, querying, and anal ysis of lar ge neur oscience datasets.
Our metadata-handling GUI produces JSON files based on JSON schema that follow the same NWB 2.0 standard. This enables easy conversion of datasets into the NWB format and simplifies the process of sharing and using data across the consortium's researc h gr oups and pr ojects. Mor eov er, our GUI's ability to generate JSON files in the same format as NWB 2.0 allows for easy data conversions between different formats, further facilitating the sharing and use of data acr oss differ ent r esearc h labs within the consortium.
Another useful strategy was the use of open-source analysis and visualization softwar e pac ka ges suc h as SpikeInterface [ 121 , 122 ], which supports import and export of data in NWB format. It elegantly solves the problem of importing har dw arespecific acquisition formats into a common environment while also providing preprocessing capabilities and streamlined access to a variety of spike-sorting algorithms . T he best wa y to ensure accur acy and r eliability in the r esults fr om differ ent spike sorters when looking at the Neur opixels pr obes is to use a consistent, well-defined analysis workflow. Using HPC clusters and remote visualization platforms (e.g., bwVisu), it was possible to overcome the challenge of real-time processing of large electrophysiology datasets fr om m ultiple r ecordings. Running similar pr epr ocessing and analysis w orkflo ws in a single HPC environment allows for ef-ficient data processing and ensures that the results are consistent and r epr oducible.
Additional assessment techniques in animals include EEG, MRI, and PET. The datasets generated by microscopic imaging techniques and by a variety of acquisition de vices, suc h as the repetitive in vivo multiphoton imaging experiments conducted over a 20-to 24-week period in living mice, are challenging to standardize. Researchers collect and view microscopic imaging data from differ ent v endor-specific acquisition softwar e in div erse file formats (e.g., TIFF, m ultipa ge TIFF, Nik on ND2, Leica LIF, Leica CZI, or ZVI). It is often difficult to read metadata from these files in other software. While TIFF is the most commonly used file format because it is easily accessible by many current analysis software platforms, it has some limitations such as long latency and delayed data access while working with large batches of files [ 123 ].
Our standardization strategy focuses primarily on tools that ar e inter oper able with pr eexisting services suc h as local data storage platforms used for storing imag ing datasets, bioimag ing softwar e a pplications (Ima geJ/Fiji, etc.) [ 85 , 124 ] utilized for analysis of data, the type of ELNs adopted within the consortium labs, and the use of HPC clusters . T he objective is to create a feasible level of automatic inter oper ability with existing data analysis and visualization tools as well as ELNs. We utilize existing bioimaging application Fiji [ 125 ], which supports the import and export of multiple imaging acquisition file formats. Fiji also allows automated extraction and display of metadata from raw files (e.g., .TIFF, .LIFF) using its own Bio-Formats plugin [ 126 ], including Bio-Formats Importer and Exporter, Bio-Formats Macro Extensions, Data Browser, and so on. We have not yet decided to use one single data standard across the imaging projects, but evaluating these community proposed formats will enable us to implement solutions that will allow users to link the experimental metadata (design protocols , biological methods , etc.) with microscope specifications , image acquisition settings, and analysis w orkflo ws in a more compr ehensiv e metadata file (JSON or OME-XML format) [ 127 ]. We are curr entl y setting up a modular pipeline for exporting the data into more open-source and standardized formats such as a nextgeneration file format OME-NGFF [ 128 ] and Microscopy-BIDS format (an extension to BIDS for microscopic imaging data) [ 129 ]. Our standardization pipeline will also allow the incor por ation of missing metadata values and will allow users to create more fields in order to support the other diverse acquired file formats.

Data standardization in human projects
All human r esearc h pr ojects acquir e m ultimodal data (neur oima ging, neur ophysiological, behavior, psyc hometrics, etc.). Within the plan of implementing good practices for data management, w e follo w ed r ecent de v elopments in data standards and methodologies to make data inter oper able. For this purpose, we created some standard protocols for each data type, along with the associated metadata.
For instance, to standardize MRI data acquisitions, we de v eloped ma gnetic r esonance (MR) acquisition pr otocols for anatomical (e.g., T1-weighted images) and functional scans (e.g., echo planar imaging [EPI]) in terms of image resolution, type of acquisition, and duration. The use of the same acquisition parameters with consistent terminologies across studies allo w ed resear chers to use similar pr epr ocessing and anal ysis pipelines, pr omoting efficiency and r epr oducibility. Suc h homogeneity in data acquisition enables also the pooling of data across projects, which is particularly suitable to increase sample size or for comparison purposes. For example, we can dir ectl y compar e the structure and function of various patient populations acquired in different projects (e.g., to examine commonalities and differences between individuals with c hr onic bac k pain and fibr omyalgia patients).
Our goals of data integration and homogeneity were facilitated by the recent opening of the Center for Innovative Psychiatric and Psyc hother a peutic Researc h (CIPP) [ 130 ], an extensiv e, modern r esearc h infr astructur e with access to neur oima ging, pharmacological, and psyc hother a peutic tec hniques. In this center, the r esearc hers shar e the labor atories and equipment, whic h allows the collection of homogeneous data types and data formats for beha vioral (e .g., motor) or sensory (e .g., quantitative sensory) testing. In addition, we set up a core set of standardized assessments (e .g., motor paradigms , the use of electronic diaries for pain assessments, quantitative sensory testing, stress-induced analgesia) and psyc hological questionnair es (e.g., hospital anxiety and depr ession scale (HADS), m ultidimensional pain inv entory (MPI) [ 131 , 132 ]) to be used across all relevant studies.
We have made significant progress in clinical projects involving human studies by using the BIDS [ 17 ] data standard for anon ymization, or ganization, and annotation of neur oima ging and behavioral data [ 18 , 133 ]. BIDS also includes support for other multimodal data, longitudinal and multisession studies, and physiological metadata collected during MRI experiments.
For example, a typical MR brain acquisition includes anatomical scans (e.g., T1-weighted images) and functional scans (e.g., EPI) and hav e pr edefined dir ectory labels in BIDS nomenclatur e, r espectiv el y, "anat" and "func." In BIDS, metadata fields common across all subjects are specified in a single JSON file in the root directory instead of multiple files repeated for each subject. Organizing the data according to the BIDS standards ensures that the metadata are automatically included in the metadata file, eliminating the need for manual input and saving time and effort in pr ocessing lar ge amounts of data. Mor eov er, adoption of BIDS enabled the de v elopment of w orkflo ws for automated data extraction, curation, and labeling. For example, automatic extraction of a minimal set of BIDS-compatible metadata can be performed using dcm2niix [ 134 ].
Regarding data stor a ge, a secur e stor a ge serv er is used for anonymized data in accordance with accepted ethical and quality standards to maintain data protection and privacy. The original sensitive data are stored separately with restricted access to reduce the risk of disclosure or unauthorized access . T he anonymized data to be analyzed are then uploaded to the labor atory serv er. The serv er is used as a shar ed infr astructur e wher e a set of open-source softwar e (e.g., fr eesurfer [ 135 ], FSL [ 136 ], fM-RIPr ep [ 137 ], QSIpr ep [ 138 )] is installed for data pr epr ocessing and anal ysis. We ar e using custom scripts for the anon ymization of MRI datasets. Curr entl y, these scripts are written in MATLAB, but we envision de v eloping a modular automated tool with an interactive GUI. The custom codes used for anonymization, preprocessing, and analysis are available online via GitHub and released under the BSD license [ 139 ]. The resulting datasets in their BIDS format can also be validated using BIDS-Validator (open-source code available at GitHub [ 140 ] and the online tool [ 141 ]). After anonymization and quality control, the datasets are available for sharing within and outside of the laboratory. The data can also be made publicly available with proper security measures.

Behavioral data standardization
To facilitate some le v el of behavior al data harmonization within CRC, we have adopted a very simple and intuitive approach. Numerous studies from the past have shown that adopting standard oper ating pr ocedur es (SOPs) and standardizing experimental conditions across labs for multisite, large-scale projects led to more accurate and reproducible results [142][143][144][145][146]. All rodent projects adhere to standardized experimental procedures for behavior assa ys . We str ongl y encour a ge eac h pr oject to shar e its SOPs, experimental site conditions, har dw are, softw are, acquisition softw are, and pr epr ocessing and anal ysis pipelines. To ensure consistency, we have provided SOPs to control variables such as mouse strain, a ge, and weight r ange. We hav e adopted a simple a ppr oac h for storing metadata for behavior datasets. For projects that combine behavioral data with any other type of experimental data (e.g., electr ophysiological r ecordings or neur oima ging), we prioritize the use of the same metadata file format (e.g., JSON, XML) that is incor por ated in the adopted data standard (e.g., NWB, BIDS) for other data types . T his helps to define parameters for behavioral paradigms and facilitates the integration of datasets. Although this a ppr oac h is not full y automated, it does pr ovide an initial le v el of data documentation, which will help to promote further standardization [ 69 ].
Acquisition of standard human behavioral data has been facilitated by a service project aiming at training researchers and homogenizing acquisition pr otocols. Mor eov er, the CIPP infr astructure enabled researchers to collect homogeneous data, acquired using similar equipment, resulting in similar formats. For example, quantitative sensory testing experiments have been standardized acr oss pr ojects in terms of measur ed v ariables and output saved as .csv files. Ho w e v er, some pr ojects collect additional data that are specific to their patient population and ther efor e do not hav e an y standar ds y et (e.g., defining bod y mark ers that trigger r eferr ed sensations or defining the modality [e.g., sensory, motor] that evokes phantom pain in amputees). Additionall y, fr amer ates or resolutions of videos recording tracking data during virtual reality experiments are also project specific and should be documented.

Data dissemination in CRC
Our consortium's collaborations with national and international neur oscience initiativ es suc h as EBRAINS [ 147 ] and the NFDI bioima ging initiativ e in German y, NFDI4BIOIMAGE [ 148 ], promote data sharing and encour a ge all pr ojects to shar e lar ge datasets (such as electrophysiological datasets from Neuropixels, cellular ima ging datasets fr om pr eclinical pr ojects) with external collabor ators fr om the international comm unity. Similarl y, data harmonization efforts significantly aid in the sharing of large human imaging datasets in open data repositories.
The consortium's data policy encour a ges the submission of published datasets to repositories and the publication of openaccess articles. Unless specifically exempted datasets, all consortium r esearc h data m ust be made av ailable via a suitable data publishing or archiving platform under a ppr opriate authorization and licensing (for example, a Cr eativ e Commons or open-source initiativ e-a ppr ov ed [softwar e] license) to allow for flexible public r euse. An y third-party data gathered by or provided for consortium r esearc h activities ar e equall y subject to these standards, unless data use a gr eements clearl y r estrict it. We assist consortium members in archiving and publishing data on heiDA T A [ 149 ], an institutional repository for research data based on the Data-Verse Project [ 150 ]. This repository supports data documentation as well as administr ativ e, tec hnical, and descriptiv e metadata; each dataset is given a persistent identification, a citable address, and a DataCite ID [ 151 ]. In addition to data publication, the hei-DA T A repository allows data access via a simple interface . T his provides for the permanent publication of data records in the repository while also providing a separate interface for regular data access. Complete datasets are collected in a dataverse estab-lished for CRC 1158 projects (research data, code, documentation, and metadata) [ 152 ].
The Research Data Competence Center (KFD) also provides specific guidelines and pr ocedur es on data r epositories, arc hiving, licensing, and access restrictions in order to provide public access to these datasets . T he KFD is curr entl y de v eloping heiARCHIVE [ 153 ], a digital long-term arc hiv e for r esearc h data pr eserv ation and arc hiving, whic h will be av ailable to the CRC 1158 during the next funding period or near the end of the current funding period [ 154 ]. This service will provide r esearc hers with an easy-touse end-user platform for archiving their r esearc h data (at least for 10 years), as well as the option of performing open arc hiv al information system (OAIS)-compatible long-term pr eserv ation with featur es suc h as format r ecognition, v alidation, and file conv ersion of a ppr opriate file formats.

Discussion
Open science and data sharing are increasingly promoted by funding organizations and research groups. Ho w ever, individual scientists often find it challenging to prioritize FAIR pr ocedur es amid competing r esearc h needs. In pr actice, a ppl ying FAIR standards involv es enormous constr aints on r esearc hers, man y of whom ar e under immense time pr essur e to deliv er outputs and may lack practical or conceptual RDM expertise. Implementing effective RDM strategies has the potential to improve the efficiency and accur acy of r esearc h and reduce the amount of time spent on data management by individual researchers [ 155 ].
Knowledge transfer, data reuse, and sharing may reduce redundant r esearc h [ 156 , 157 ]. One of the goals of data management is to mak e stud y r esults fr eel y av ailable thr ough open-access publishing. By investing in collaborative projects with long-term goals, it ensures that the data are organized in a way that makes them easily accessible and retrievable for future use . T his makes it easier and faster to de v elop ne w r esearc h pr ojects, as well as to replicate or build on existing studies, which should have a direct impact on public r esearc h funding. Effectiv e data mana gement str ategies can indeed enable the optimization of public r esearc h funding by pooling resources and infrastructure from multiple sources and bringing together experts fr om univ ersities, r esearc h institutes, and other community organizations to work on long-term interdisciplinary pr ojects. Furthermor e, by ensuring pr oper RDM, r esearchers should be able to reduce animal use. For example, making informed decisions about which animal models to use for their studies should enable them to use the same animals for multiple experiments, instead of having to continuously use new animals for each study. In addition, sharing pr e viousl y acquir ed data with adequate metadata or reusing control group data from similar studies can avoid repeating in vivo work [158][159][160][161].
We have learned from our own experience of implementing RDM activities across the consortium that many projects do not full y r ealize the benefits of av ailable infr astructur e and r esources. This is partly due to uncertainties regarding the organizational and technical requirements and a lack of knowledge of existing resources . T herefore , we place great emphasis on promoting and encour a ging the use of available generic tools and infr astructur es whene v er possible. Our RDM implementation strategy is based on the flexible and easy integration of existing, maximally generic components to support r esearc hers in implementing specific solutions for data collection, pr ocessing, anal ysis, stor a ge, publication, and, when a ppr opriate, the de v elopment of sustainable, pr oject-specific infr astructur es, suc h as data and metadata standardization tools.
The centr al str ategy of the consortium inv olves coor dinated cr oss-species anal yses in experimental animal models and in human subjects. To ac hie v e this, we use multiscale imaging; electrophysiological, psychometric , and beha vioral readouts; and a range of interv entional str ategies acr oss both r odent and human populations. Our consortium involv es tr anslational r esearc h pr ojects, with the goal of translating animal research into human applications or from basic science to treatments and therapies that benefit patients. To ensure the long-term preservation of valuable data sets, we de v eloped a CRC RDM policy (see section about funding information and supplementary Information 1) in accordance with the German Research Foundation (Deutsche Forschungsgemeinschaft [DFG]) guidelines for research data handling [ 162 ]. The CRC data policy serves as a recommended guideline for individual projects on how to format their data. The acquisition of heterogeneous data in multiple projects can render the process of data formatting challenging, and the data policy guidelines do not specify a le v el of granularity for data formatting. Ho w ever, they do pr ovide gener al r ecommendations, suc h as using standard data formats and tagging data with descriptive metadata, for formatting suc h div erse datasets. To support individual r esearc h gr oups, we pr ovide r esources and funds for the implementation of modern tools and infr astructur e that ar e compatible with comm unity RDM standards . For instance , we recommend using "standard" formats for data of similar modalities (e.g., neur oima ging data [MRI] formatted to NIfTI [Neur oima ging Informatics Tec hnology Initiativ e], electr ophysiology data formatted to NWB) as described in the data standardization sections.
Our data policy places emphasis on the significance of data documentation and sharing, while also promoting the use of open-access repositories to facilitate data sharing. In addition, our RDM services provide general support for research groups, such as assistance with deploying cloud-based data stor a ge solutions (e.g., Amazon S3, Google Cloud Stor a ge, and Micr osoft Azur e), establishing data gov ernance policies, using automated softwar e to str eamline data stor a ge and r etrie v al, and educating r esearc hers about data priv acy and GDPR r egulations, among others.

Da ta ste w ards, comm unity engagements, and collabor a tions
To effectiv el y implement data policy and gov ernance pr ocedur es in large consortia, the presence of data managers and stewards is crucial [ 163 ]. These roles are ideally suited for individuals with a bac kgr ound in r esearc h, computer science, bioinformatics, and str ong comm unication skills. Additionall y, candidates should have experience in de v eloping high-thr oughput analysis pipelines, domain-specific data structures and standards, open-access publishing, modern data science a ppr oac hes, highperformance computing en vironments , cloud computing, data security, and databases, among other things, depending on the consortium's needs . T he data manager's diverse role inv olves w orking closely with core computing and library resources to streamline access to the consortium's common r esearc h data infr astructur e, which is available at the host institutions of participating labs.
Data managers are responsible for supporting ongoing researc h, pr oviding guidance on best practices for data handling, and k ee ping up to date with the latest de v elopments in the RDM field. They act as a vital link between consortium r esearc hers, collabor ators, the univ ersity's RDM planning gr oup and computing center, and comm unity or ganizations. By bridging the gap between lab-based scientists and available technical infrastruc-ture and services, they provide direct assistance in daily tasks such as data organization, tool selection, w orkflo w de v elopment, and standardization, which benefit individual researchers and researc h gr oups. Addr essing data mana gement tasks earl y in the r esearc h timeline is essential to making the r esearc h pr ocess more efficient and ensuring the inter oper ability and r eusability of datasets. Expert guidance on existing infr astructur e and r esources, such as scientific repositories , databases , and legal and ethical issues, is also necessary to promote an effective datasharing strategy.
In our consortium, the data manager maintains consistent comm unication with v arious r esearc h gr oups and other consortia to establish a community network and links to other scientific comm unities, suc h as the National Researc h Data Infr astructur e (NFDI) consortium. This ensur es the dissemination of data throughout the community and the development of data mana gement tec hniques that ar e specificall y designed to facilitate neuroscience research. Our consortium actively engages in numerous international and national RDM initiatives, including NFDI4BIOIMAGE and EBRAINS, which promote the development of high-le v el infr astructur es and services acr oss v arious scientific disciplines. Our activ e enga gement in div erse task ar eas of these comm unity-led initiativ es, including Neur omor phic Computing, Data Analytics , Workflows , GDPR, T he Virtual Brain Cloud, and so on, is instrumental in supporting the de v elopment of a sustainable and community-oriented RDM strategy.
To ensure the harmonization of our RDM efforts, our CRC follows the recommendations of the International Neuroinformatics Coordinating Facility [ 164 , 165 ] and employs communityde v eloped standards that have gained international recognition for neurophysiology and neuroimaging datasets, such as BIDS and NWB (see list [ 166 ]). Mor eov er, we utilize r esources like FAIRsharing [ 167 , 168 ] and the UK Digital Curation Centre [ 169 ] to provide a compar ativ e ov ervie w of data and metadata standards. In addition, we participate in the Research Data Alliance [ 170 ] and European Open Science Cloud [ 171 , 172 ] initiatives to adopt and dev elop nov el r esources for open data exc hange acr oss tec hnologies and scientific disciplines.
We also focused on de v eloping RDM str ategies that included joint efforts and cooperation between consortium members and other large-scale consortia and collaborative centers within German y. We ac knowledge the common issue of data or ganization for differ ent pr ojects in collabor ativ e centers . J oint efforts were made for the de v elopment of a data or ganization str ategy that works for most of the projects within the consortium working in similar r esearc h ar eas. Our main goal was to enga ge mor e dir ectl y in se ver al ov erlooked aspects of managing data in a lar ge collabor ativ e consortium while k ee ping the global neur oscience comm unity in mind.
We recommended the CRC 1158 project members to utilize logical file and folder templates to support systematic data organization. Consistent folder organization depends on the type of research data acquired for a project as well as on the governance pr ocedur es. Our goal is to provide r esearc hers with an easy way to manage their project digital files and datasets on different data infr astructur e services, both locall y and on subject-specific data r epositories suc h as GIN: a Modern Researc h Data Mana gement System for Neuroscience. For this purpose, we use folder structure templates for research repositories developed in collaboration with the NFDI Neuroscience (NFDI-Neuro) consortium (curr entl y nonfunded) and 3 neuroscience CRCs (CRC 1158, CRC 1315, and CRC/TRR 135), [ 173 ]. The template structure is available on Zenodo [ 92 , 173 ].
These templates are designed to reflect the typical w orkflo w of a r esearc h pr oject. This means that the structure is organized in a way that makes it easy to tr ac k the different stages of data acquisition, pr ocessing, and anal ysis. Depending on the specific needs of the CRC projects, we customize the templates based on the type of experiment or data modality as well as the analysis processes that should be integrated with existing data organization systems . T he template structur e includes separ ate sections for raw data and analyzed data, as well as for documentation and code related to each stage of the w orkflo w . T o illustrate, neur oima ging datasets suc h as MEG or fMRI that are in the BIDS format can be efficiently organized and stored in the "03_data" dir ectory. The r aw data (BIDS raw, e.g., NIfTI and JSON) can be stored in the subfolder "001_defaultexp" (default experimental data), and the analyzed data (BIDS deri vati ves , e .g., fMRIPrep, SPM, FSL, Fr eeSurfer, QSIpr ep) can be stored in the subfolder "999_pro-cessed_data." It is recommended to add w orkflo ws and code libraries used for data analysis to the designated analysis directory (i.e ., "04_data_analysis"). T hese folder structure templates facilitate r epr oducibility and data sharing, and they can be utilized on differ ent stor a ge de vices to accommodate v arious data sets generated during experiments, independent of their format.

Sharing sensiti v e data fr om human pr ojects
The sharing of human data gathered from clinical or nonclinical populations in neuroscience research is essential for advancing science and producing important public health benefits. Howe v er, a clear set of regulations and guidelines must be established before sharing human data gathered from clinical or nonclinical populations. Specific rules ad dressing pri vacy issues, established processes for data protection, data use and reuse, and the preserv ation of sensitiv e data ar e r equir ed. It is essential to make data accessible and understandable to remote (or future) collaborators in order to maximize the potential of existing algorithms and tools and accelerate the creation of new ones. Regulations and guidelines should ensure that the data are used for the purpose for which they were gathered and protect the rights of participants in the r esearc h. T hese guidelines should also co ver how data should be collected, stored, shared, and destro y ed. They should also specify the types of data that must be k e pt confidential and the appropriate methods for handling and safeguarding the data. Additionall y, r egulations should ensure that the data are secure, k e pt confidential, and not used for marketing or other commercial purposes . Furthermore , regulations should ensure that the data are used r esponsibl y and not used to discriminate against people with disabilities or other vulnerable populations. Ethical rules for the reuse and sharing of data should be based on the principle of informed consent. This includes obtaining consent from the original data collectors or from research participants, as well as obtaining permission fr om an y thir d parties inv olved in the data collection. Researchers should also seek to minimize the risk of data misuse or br eac h of confidentiality.
Additionall y, ther e is a lack of efficient software programs that can adequately segregate and maintain control over sensitive data. De v eloping effectiv e softwar e that is secur e, user-friendl y, and cost-effective can be difficult. The maintenance of such softwar e r equir es a significant investment of resources, and often there is a lack of funding available for such measures. Furthermore, the adoption of such software requires an investment in training and resources, which many organizations may be unwilling to do.
The legal and ethical r equir ements surr ounding the use of sensitive data are often complex and difficult to understand, leading to confusion and ambiguity about the best way to protect them. It is important to pr ovide r esearc hers working with sensitive data with truly "useful" tools that do not require preexisting, in-depth knowledge of legal and ethical r equir ements or time to delve into the details. Such tools are essential to ensure that sensitive data ar e pr otected and secur el y stor ed.
The use of such tools can help r esearc hers make informed decisions about how to best use and mana ge sensitiv e data, allowing them to work with them in an ethical and responsible manner. Finally, these tools can help to reduce the risk of data breaches and data misuse, which can have serious consequences for the people and organizations whose data are affected. By providing such tools, r esearc hers can focus on their r esearc h r ather than legal and ethical considerations , sa ving time and resources.
Se v er al softwar e tools can be used to maintain sensitiv e patient data in neur oscience r esearc h, suc h as a web-based platform REDCa p (Researc h Electr onic Data Ca ptur e) [ 174 ], an opensource imaging informatics platform XNA T (XNA T Central) [ 175 ], and LORIS: Longitudinal Online Research and Imaging System [ 176 ]. It is important to note that the security features of these tools may vary and should be e v aluated befor e use. In addition to softwar e tools, secur e data stor a ge and access protocols should also be in place to ensure that sensitive patient data are protected.
Curr entl y, we ar e expanding our collabor ativ e efforts by cr eating a data infr astructur e platform that will establish a GDPRcompliant data registry called the PainReg registry. It is based on the Germany-wide ParaReg registry [ 177 , 178 ] for human volunteers. To facilitate cr oss-pr oject data mer ging, a cor e clinical dataset will be defined, assigning a unique identifier to each study participant that is shared by all projects . T his allows researchers to determine if a volunteer has participated in multiple projects, thereby helping to reduce redundant data acquisition. This can result in cost and time sa vings , as well as increased data collection accuracy. For example, we found that the same study participant could be tested twice and assigned different IDs, belonging to differ ent pr ojects, r esulting in r edundant data acquisition and unnecessary increased costs, particularly for genetic analysis.
Furthermore, the data registry will ensure that data privacy r egulations ar e strictl y follo w ed b y obtaining participants' consent to access data for secondary or follow-up studies . T his will also include an identity management feature to limit access to authorized users . T he r egistry will contain a wide r ange of data, including br ain ima ging, genetic, cognitive, and physiological. This collabor ativ e work will be coordinated by the consortium's future data infr astructur e pr oject, whic h will be tasked with implementing, testing, optimizing, and standardizing data analysis procedures and models to be utilized in all projects.

Da ta integr a tion
For collabor ativ e r esearc h, the step of data integration and standardization is crucial for inter oper ability and data sharing [ 179 ], but it can be quite challenging to implement given the wide range of methodologies represented in the consortium. Early standardization of data can hav e massiv e benefits for data integration in collabor ativ e pr ojects . T his can be ac hie v ed by str eamlining the use of tools for more replicable and reproducible analysis . T he data integration process often depends on the individual projects and their underlying w orkflo w and processes . T he degr ee to whic h this is possible will depend on the modalities used, the subject population, and the experimental design. Re-searchers may also integrate the raw data collected from each partner into a core dataset. Integrated datasets can provide a more compr ehensiv e understanding of the r esearc h question, as well as allowing the r esearc hers to compar e the r esults of their analyses mor e dir ectl y. Depending on the modalities used, the data may need to be transformed or normalized befor e integr ation, and the analysis techniques may need to be adapted to the combined dataset.
For some pr ojects, r esearc hers can e v en anal yze data fr om their partners and vice versa. In some human projects, fMRI data from one group and EEG data from another group are combined to gain a better understanding of how the 2 modalities interact. This may involve combining datasets or running analyses on the combined dataset to identify common patterns or trends. Ho w ever, this process r equir es car eful consider ation of the data sources, data formats, and analysis techniques used by individual labs, as well as the selected methods for data fusion and data mining. At the most basic le v el, r esearc hers can compar e the data collected fr om eac h partner to identify commonalities and differences in the data. This could include comparing the number and types of modalities used, the subject population, the experimental design, and the type of analysis performed. For example, they could investigate how brain structure (e .g., gra y matter volume, cortical thickness) relates to beha vior. T here are also association studies aiming to compar e br ain activity between 2 gr oups of participants (e.g., healthy, c hr onic pain) to explor e neur al differ ences in cognition or beha vior. T hey could also examine associations between neural activity in different brain areas and physiological responses of the subject.
Furthermor e, r esearc hers can use deep learning algorithms to identify patterns in the data and gain insights that can enhance the understanding of the brain. For instance, deep learning algorithms can detect patterns in EEG data to identify different states of consciousness or seizur es. Additionall y, artificial intelligence techniques can be emplo y ed to combine multiple datasets to gain a better understanding of the complex relationship between brain and behavior.
From our own experiences, we have realized that a systematic effort in de v eloping standardized guidelines for multimodal data acquisition would str ongl y facilitate the data integration process and promote the adoption of FAIR data standards across all studies. Our CRC is de v eloping a m ultimodal digital interv ention platform that aims to combine data collected by the CRC projects for further analyses . T his platform benefits fr om an incr eased sample size, which should result in improved prediction accuracy, with the potential to optimize ther a pies [ 180 ], suc h as the use of invasive or nonin vasive neurostimulations . Our current efforts in harmonizing and standardizing datasets (e.g., using BIDS) and prepr ocessing a ppr oac hes would also facilitate this de v elopment and further impr ov e data anal ysis.

Da ta standardiza tion
We have devised a set of strategies to ensure that the datasets can be thor oughl y documented and converted into open data standards with a minimum amount of effort. The consortium's projects combine datasets from electrophysiological recordings, optogenetic manipulations, rodent behavior assays such as sensory testing (v on F rey filaments), cold plate test and open field test [ 181 , 182 ], 2-photon in-vivo, and MRI, r esulting in a plethor a of disparate file formats and unorganized metadata. These datasets ar e sav ed in a v ariety of file formats , including video files (.a vi), the original raw ASCII log files, text-based file formats (.csv), and

Adoption of project-specific DMPs
Funding agencies and research organizations are increasingly requesting DMPs when submitting a grant application. The obligation to submit a DMP depends on the r equir ements of the funding organizations. DMPs should be created early on, ideally when a ppl ying for funding or at the beginning of a r esearc h pr oject, and updated as needed. For example, eur opean r esearc h council (ERC)-funded projects that participate in the Horizon 2020 Open Research Data pilot are required to submit the first version of their DMP within 6 months after the start of their grant. Openaccess publications are encouraged, and grantees should demonstrate FAIR-compliant data management and resource use. Howe v er, some r esearc h studies involving sensitiv e data ar e exempt from these requirements.
The DMP de v eloped for eac h r esearc h pr oject highlights r elevant information regarding research data and associated metadata that are required for research result reproducibility. Preliminary versions of DMPs can surely assist participating labs in making informed decisions about their data management resource requirements (financial support or personnel). DMPOnline [ 213 ] and RDMO [ 214 ] are 2 commer cial open-sour ce softw are solutions for creating custom DMPs [ 215 , 216 ]. Several DMP templates have already been made available in response to funding agency criteria [ 217 ].
Individual project DMPs can be created using these templates, or if a dataset r equir es particular RDM resources, a datasetspecific DMP can be created. These DMP templates cover questions about how data are handled at each stage of the project, including a general project description, experimental and dataset descriptions, specific data documentation (types of data and experimental models, methods for acquisition and collection, questionnair es, and anal ysis softwar e), decisions on data and metadata standards and formats, and proposed plans for organization, access, sharing, short-and long-term stor a ge, r euse, and implementation costs . T he document, once pr epar ed, explains the management of the research data acquired, reviewed, and processed as part of the CRC 1158 initiatives . T he template includes some generic questions regarding best practices for each stage of the data management life cycle that may be answered early in the project, while domain-specific questions can be answered later in the project.

Data stor age, or ganization, and sharing
In addition to providing support for access and use of internal univ ersity r esources for data stor a ge and sharing, the CRC also supports adoption of innov ativ e comm unity-de v eloped solutions. Versioning of data sets, along with software and code, becomes critical for suc h pr ojects as data files and metadata are updated o ver time . Even in the case of complete datasets published or submitted in a r epository, v ersioning helps in tr ac king c hanges in the data files or metadata that are incorporated after data reuse or r eanal ysis . For instance , since most CRC pr ojects run for m ultiple funding periods and involve extended analyses, data versioning becomes e v en mor e crucial. Researc hers often modify, r efine, or add to their datasets during the course of their r esearc h. Without pr oper v ersioning, it may not be possible to r epr oduce pr e vious findings, which can negatively impact the credibility of the r esearc h outcomes.
T his includes , for example , data comparison between various groups of pain patients collected at different funding periods, associations between various types of data modalities (e.g., data col-lected using electr oencephalogr a phy for the first funding period and fMRI data during the next funding period), or simply comparisons between various analysis toolboxes (e.g., FSL vs. SPM).
Platforms such as DataLad [ 200 ] and GIN [ 202 ] may effectiv el y compensate for a lack of local resources. Data hosting and sharing can also ensure data versioning and encourage reproducible management of scientific data. Both DataLad and GIN are based on git and git-annex to provide a decentralized system for the exchange of large datasets. Datalad is an open-source software package for the management of distributed datasets. It facilitates the acquisition, organization, and management of data stored in remote repositories but does not offer stor a ge. Mor eov er, the GIN service can be deplo y ed locally at all participating labs and can be used as an in-house stor a ge serv er and web user interface for DataLad datasets. Datasets hosted on either platform can be accessed via git-compatible systems. Other examples of resources for supporting collabor ativ e w orkflo w de v elopment and integr ation of data hosting and pr ocessing/anal ysis computing resources include the Open Science Fr ame work [ 218 ] and the Open Science Grid [ 219 ].

Data analysis and visualization
We ar e curr entl y r efactoring and de v eloping ima ge anal ysis tools for the automated running of deep learning applications on bwVISU [ 220 ]. With such extra computational resources, it is possible to set up automated analysis w orkflo ws on HPC that could allow for faster, more accurate diagnoses in near-real time . T he goal of this project is to implement a deep learning API for image data processing and to provide a platform for the scientific comm unity to dir ectl y compar e and integr ate data gener ated acr oss the consortium projects . T he de v elopment of an open-source and extensible platform to train and share deep learning models will guarantee high standards in many image analysis w orkflo ws and additionall y r educe the amount of annotated data necessary for training supervised deep learning algorithms. For example, our initial efforts involve integrating the most commonly used deep learning ima ge anal ysis tools to make the initial GUI mor e flexible for model training and inferences. Some of the considered tools include StarDist [ 221 ], Noise2Void (image denoising) [ 222 ], CellPose (cell segmentation) [ 223 ], and Elektronn3 [ 224 ] (EM data segmentation).

Data dissemination
Public neur oscience r epositories ar e r a pidl y being de v eloped, and a lot of pr ogr ess has been made in this dir ection. Se v er al data repository options can be found in online resources [ 225 , 226 ]. It is worth searching for both gener al-pur pose r epositories, suc h as Zenodo [ 227 ], and domain-specific repositories. For example, for repositories on chronic pain, we use OpenPain [ 228 ], the Pain and Interoception Imaging Network repository [ 229 , 230 ], and ENIGMA [ 231 ]. In the context of bioimaging data stor a ge and sharing, the EMBL-EBI BioImage Archive [ 231 ] is a large-scale, centralized data resource that hosts r efer ence ima ging data. The Open-NEURO pr oject [ 232 ], whic h was originally created for the free and open sharing of raw MRI datasets (old datasets available [ 233 ]), has since expanded to include datasets from other neur oima ging modalities suc h as MEG , EEG , and PET and has been r enamed the OpenNeur o Pr oject [ 234 , 235 ]. Certain r epositories r equir e datasets to be submitted in a standardized format; for example, OpenNeur o (whic h accepts anon ymized human-deriv ed datasets), OMEGA (Open MEG Arc hiv e, exclusiv el y for MEG data), and MNE-BIDS have all adopted the BIDS format (which links BIDS and the MNE-Python analysis tool for MEG and EEG data). In addition to providing basic features such as data hosting and support for metadata files, there are repositories that provide restricted data sharing and anon ymization services, whic h ar e highl y suitable for publishing datasets from clinical projects . T he Cancer Ima ging Arc hiv e (TCIA) the LONI Ima ge Data Arc hiv e ar e 2 examples.
Other neuroscience-focused data repositories with specific purposes include G-node GIN for datasets deriv ed fr om both human and nonhuman or ganisms, Br ainLife (human neur oima ging) [ 236 ], Distributed Arc hiv es for Neur ophysiology Data Integr ation [ 237 ], and F enix-backed EBRAINS [ 238 ]. T he HBP EBRAINS data curation team may assist with data submission and integration, as well as to provide defined embargo durations to allow for progressiv e disclosur e . T he Eur opean Union-funded Human Br ain Pr oject produced EBRAINS, an open European digital research infrastructure that provides one of the most complete platforms for sharing br ain r esearc h data of v arious types, as well as spatial and temporal scales.
Aside from these domain-specific re positories, n umerous wellknown open data repositories, as well as sharing and management platforms, accept data from a wide range of disciplines. Zenodo [ 239 ] and Dryad, for example, are an online arc hiv e that mana ges r esearc h datasets with metadata and allows long-term data access via persistent identification. Figshare [ 240 , 241 ], a commercial free data repository with unique features such as custom stor a ge options, v ersion contr ol, visualization, metadata customization, data curation using DOI, and so on, is one example. The EMBL SourceData SmartFigure [ 242 ] focuses on the scientific figure as a sharing unit, combining data sharing and visualization. T he Harvard Data verse Network is both a platform for institutions and a data repository implemented on FAIR data principles to publish, share, reference, extract, and analyze research data. Consortia offering support and access to cloud computing, such as OpenScienceGrid, J etstreamCloud, F enix, and the Eur opean Commission-bac ked Eur opean Open Science Cloud, can support analysis if institutional solutions are not a vailable .
In addition to contributing datasets to repositories, sharing other r ele v ant information suc h as experimental pr otocols, source code, and r esearc h softwar e used for processing or analyzing these datasets is essential for r epr oducing r esearc h findings [ 243 ]. This a ppr oac h can complement existing efforts to standardize complete datasets and to facilitate contributions to repositories. Ho w e v er, when dealing with complex projects that involve sensitive data, data sharing can become more difficult. In such cases, additional measures such as obtaining consent forms, implementing access regulations, and using anonymization strategies may be r equir ed to protect data confidentiality.
For the long-term pr eserv ation of data, r esearc hers need permanent archiving systems, along with sufficient funds to build suc h arc hiv es for both internal usage and to satisfy the open data needs of journals and funding or ganizations. Ideall y, arc hiv al systems should be de v eloped with the user's perspectiv e, especiall y in scientific settings wher e r esearc hers with limited expertise in digital pr eserv ation collabor ate on pr ojects that gener ate a wide range of data [ 244 ].
Researchers often encounter challenges in determining the best practices for archiving and gaining access to archival systems. Researc hers r equir e assistance and support to ensure that their r esearc h data ar e a ppr opriatel y arc hiv ed and accessible to other r esearc hers. First, r esearc hers need guidance on best practices for archiving their research data, including how to stay informed about the stor a ge, r etention, and disposal of all research data. This is especially crucial as good arc hiv al pr actice r equir es a sc heduled r e vie w of data in long-term stor a ge . T here-for e, r esearc hers need to remain informed about these practices, whether their data are stored in an institutional or external repository. In addition, r esearc hers need support to ensure that their data handling complies with various regulations and guidelines. This includes existing discipline-specific privacy and ethical standar ds, cop yright or licensing arrangements, and publication and legal r equir ements. Meeting these r egulations and guidelines is crucial to ensure that researchers' data are protected and accessible to other r esearc hers. To addr ess these c hallenges, r esearc h institutions and organizations can provide researchers with the necessary tr aining, r esources, and infr astructur e for arc hiving their data. This includes de v eloping guidelines and policies on data mana gement and sharing, pr oviding access to data r epositories and stor a ge facilities, and offering tr aining and support on data management and archiving best practices.
The period for which data should be preserved for research purposes or archiving should be determined by pr e v ailing standards for the specific type of r esearc h domain and should follow the retention policies of any applicable stakeholders (e.g., sponsoring institution, funding agency). For example, in the context of our consortium funded by the DFG, primary r esearc h data should be a ppr opriatel y arc hiv ed in the r esearc her's own institution or an a ppr opriate nationwide infr astructur e for at least 10 years [ 162 ].

Conclusion
We hav e pr esented a data mana gement str ategy that we de v eloped and put into practice within the framework of a collaborative r esearc h center, encompassing both basic and clinical r esearc h on humans and animals. To foster FAIR and open science, this strategy strives to offer practical solutions for multimodal and multidisciplinary r esearc h. This str ategy is composed of ada ptiv e and incremental phases: planning, implementation, and dissemination. Consistent communication with consortium project members during the planning and implementation phases was crucial to identify the most helpful RDM measures. We spent a considerable amount of time learning about publicly accessible tools, services, and new developments in the RDM field that could be beneficial to our consortium. We belie v e that this knowledge might be beneficial to other r esearc hers.
In the planning phase, we e v aluated common data management practices across projects. We categorized projects based on the typical population studied and the common measurement methods used. We focused on addressing issues such as metadata management, documentation of experimental protocols, preprocessing and analysis pipelines, data storage and data volume, data sharing, data dissemination, data archiving, and sensitive datarelated issues that arise when working with highly diverse and heterogeneous data. The complexity was subsequently raised by the major RDM challenges encountered in tandem projects that work with both human and animal populations, such as including data and metadata standardization, the integration of various different data types, and the harmonization of datasets and analysis w orkflo ws.
In the implementation phase, we presented some innov ativ e solutions based on preexisting and customized solutions developed for flexible and incremental data management solutions with a focus on r esearc h collabor ations. We discussed the implementation of project-specific data management plans, structured based on data acquisition, processing, and analysis methods across the CRC 1158 projects. Relatively simple measures, such as offering ELN options for documenting experimental protocols or tutorials on HPC resource access and regular data seminars on ba-sic RDM tools (such as data versioning tools, code and w orkflo w mana gement softwar e), can impr ov e data mana gement pr actices in noticeable wa ys . We focused on the de v elopment of ne w tools for metadata organization and management depending upon the r equir ements of eac h pr oject and the type of data collected. In animal projects, we assisted with migration from proprietary formats and supported experimental annotation and organization. Mor eov er, for lar ge datasets, we pr ovided easy access to software and tools on large web-based applications to enable interactive analysis and visualization.
For human projects, we adopted standard protocols to associate various data types with the r espectiv e metadata. In the case of MRI, we standardized MR acquisition protocols, data organization, and pr epr ocessing pipelines . Beha vioral, sensory testing, and psyc hological questionnair es wer e standardized in collabor ation with a service project of CRC 1158.
The CRC 1158 emphasizes that active communication and engagement with general and domain-specific RDM community initiativ es ar e r equir ed for the de v elopment of RDM str ategies for an y lar ge-scale r esearc h consortium. Modern r esearc h infr astructure and technological advancements, such as web-based technologies for sharing data and analysis tools , pro vide opportunities to increase the reproducibility of research outcomes in both basic and translational neuroscience.
Further de v elopment of this RDM model with mor e specialized tec hnical infr astructur e is envisioned for the next period of the consortium. A federated data-sharing approach is required for m ultisite and m ultispecies pr ojects, whic h will allow for the integration of data from different computer systems for participating labs that ar e geogr a phicall y distributed without moving the data to a centralized location.

CRC 1158 data management policies and funding information
CRCs (short SFB for German "Sonderforsc hungsber eic h") ar e univ ersity r esearc h pr ojects that ar e funded by the DFG, gener all y for a period of up to 12 years [ 245 ]. The Heidelber g P ain Consortium [ 246 ] is a collabor ativ e r esearc h center (CRC 1158) composed of 44 principal investigators in Germany aiming to understand the mechanisms of pain and pain chronicity to identify causal links and possible ther a peutic interv entions . It in volves a multidisciplinary team of scientists and clinicians working on 23 different projects, including 12 animal projects, 6 tandem (human-animal) projects, 4 human projects, and 1 central administrative project. CRC 1158 includes additionally 2 service projects: the first one (tandem project) aims at establishing standard protocols , models , and ethical standards to facilitate homogeneous implementation across all human or rodent projects . T he second service project (animal project) aims at developing simplified systems to accelerate the analysis of the translational potential of acquired research insights.
Since 2015, the DFG has supported CRC 1158 [ 247 ]. In June 2019, CRC 1158 was successfully renewed and got funding for another 4 years, 2019-2023, under project number 255156212 from DFG. CRC 1158 has many national (e.g., the University of Heidelberg, the Central Institute for Mental Health, European Molecular Laboratory, and German Cancer Research Center) and international collaborations (institutions located in the United States, Canada, England, and France).
The Heidelber g P ain Consortium implemented a de v elopment strategy in a central administration project (Z01) [ 248 ] to promote RDM as an integral part of the research process in order to maximize the impact of collabor ativ e science. By implementing this strategy, the consortium was able to take a systematic and standards-based a ppr oac h to documenting, arc hiving, and sharing its r esearc h data with collabor ators and the r esearc h comm unity, with the goal of significantly accelerating scientific progress. This RDM model is expected to e volv e in response to the development of new and specialized (domain-specific) technical infrastructures.
As a CRC host institution, Heidelberg University offers comprehensiv e r ecommendations for the administration of r esearc h data [ 249 ]. CRC's data policy (available in Supplementary Data 1) highlights the use of a variety of RDM services to researchers to ensure that RDM for each project adheres to the DFG guidelines [ 250 ].
These services include aid with proper data documentation, the integration and support of open data management solutions, data stor a ge and accessibility, the de v elopment of ne w tools for the adoption of open data and metadata standards, the sharing of v arious div erse datasets within the consortium and with external collaborators, and the dissemination of research outcomes into national and international data repositories . T he policy is applicable to all r esearc hers working in the CRC, including PIs, doctor al and postdoctor al r esearc hers, and student r esearc h assistants. It also applies to any research project carried out within the CRC as well as to any data generated or shared (with outside sources).

Da ta av ailability
All supporting data are available via the GigaScience repository, Gi-gaDB [ 251 ].